[Bioc-devel] how to minimally serialize a FilterRules object

Martin Morgan martin.morgan at roswellpark.org
Wed Jul 5 23:17:00 CEST 2017


On 07/05/2017 05:12 PM, Robert Castelo wrote:
> On 05/07/2017 20:39, Martin Morgan wrote:
>> On 07/05/2017 12:59 PM, Robert Castelo wrote:
>>> dear developers,
>>>
>>> in the framework of a package i maintain, VariantFiltering, i'm using 
>>> the 'FilterRules' class defined in the S4Vector package and i'm 
>>> interested in serializing (e.g., saving to disk via 'saveRDS()') 
>>> 'FilterRules' objects where some rules may defined using functions.
>>>
>>> my problem is that the resulting RDS files take much more space than 
>>> expected because apparently the environment of the functions is also 
>>> serialized.
>>>
>>> a toy example reproducing the situation could be the following:
>>>
>>> library(S4Vectors)
>>>
>>> ## define a function that creates a ~7Mb numerical vector
>>> ## and returns a FilterRules object on a function that has
>>> ## nothing to do with this vector, except for sharing its
>>> ## environment. this tries to reproduce the situation in which
>>> ## a 'FilterRules' object is defined within the package
>>> ## 'VariantFiltering' where the environment is full of stuff
>>> ## unrelated to the 'FilterRules' object being created.
>>>
>>> f <- function() {
>>>    z <- rnorm(1000000)
>>>    g <- function(x) 2*x
>>
>> I guess
>>
>>     g <- function(x) 2 * x > 10
>>
>> or similar would satisfy the requirements of FilterRules to return an 
>> equal-lengthed logical vector
>>
>>
> oops, yes of course.
> 
>>>    fr <- FilterRules(list(g=g))
>>>    fr
>>> }
>>>
>>>
>>> ## call the previous function to get the FilterRules object
>>>
>>> fr <- f()
>>>
>>>
>>> ## while the 'FilterRules' object takes 3.3 Kb ...
>>>
>>> print(object.size(fr), units="Kb")
>>> 3.3 Kb
>>>
>>>
>>> ## ... serializing it takes ~7Mb
>>>
>>> print(object.size(serialize(fr, NULL)), units="Mb")
>>> 7.6 Mb
>>>
>>
>> I added the test case
>>
>>   testthat::expect_equal(eval(fr, 1:10), rep(c(FALSE, TRUE), each=5))
>>
> but then
> 
> g <- function(x) x > 10
> 
> which is good for simplicity
> 
>>> i guess this is the expected behavior behind functions and 
>>> environments, but after reading about this subject (e.g., 
>>> http://adv-r.had.co.nz/Environments.html) i still haven't been able 
>>> to figure out how to serialize the 'FilterRules' object without the 
>>> associated environment or with a minimal one without unnecessary 
>>> objects around.
>>>
>>> i'm sure many of you will have an easy workaround for this. any help 
>>> will be highly appreciated.
>>
>> One possibility is to set the environment of g() to something that 
>> resolves appropriate symbols, e.g.,
>>
>> f <- function() {
>>     z <- rnorm(1000000)
>>     g <- function(x) 2 * x > 5
>>     environment(g) <- baseenv()
>>     FilterRules(list(g=g))
>> }
>>
>> the serialized size is then 11 kb and the test continues to pass. The 
>> environment needs to be baseenv to resolve `*` and `>`; emptyenv() is 
>> too restrictive. A package name space might often be appropriate 
>> (though maybe large).
>>
>> Maybe that's a Hack, and Michael or others will chime in with 
>> something better...
>>
> thanks!! indeed this reduces the size down to 1 kb:
> 
> f <- function() {
>    z <- rnorm(1000000)
>    g <- function(x) x > 5
>    environment(g) <- baseenv()
>    fr <- FilterRules(list(g=g))
>    fr
> }
> 
> fr <- f()
> testthat::expect_equal(eval(fr, 1:10), rep(c(FALSE, TRUE), each=5))
> 
> print(object.size(fr), units="Kb")
> 1Kb
> print(object.size(serialize(fr, NULL)), units="Kb")
> 1Kb
> 
> how would set the environment of the function to a package namespace?
> 
> wouldn't make more sense to leave it with baseenv() and call 
> 'require(pkg)' within the function to load whatever the function needs 
> from package 'pkg'?

environment(g) = getNamespace("S4Vectors")

but yes, maybe via setting to baseenv() and fully resolving symbols 
foo::bar() rather than require / etc.

Martin

> 
> robert.
> 
>> Martin
>>
>>>
>>>
>>> thanks!!
>>>
>>> robert.
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>
>> This email message may contain legally privileged and/or confidential 
>> information.  If you are not the intended recipient(s), or the 
>> employee or agent responsible for the delivery of this message to the 
>> intended recipient(s), you are hereby notified that any disclosure, 
>> copying, distribution, or use of this email message is prohibited.  If 
>> you have received this message in error, please notify the sender 
>> immediately by e-mail and delete this email message from your 
>> computer. Thank you.
> 
> 


This email message may contain legally privileged and/or...{{dropped:2}}



More information about the Bioc-devel mailing list