[R] Large file size while persisting rpart model to disk
Duncan Murdoch
murdoch at stats.uwo.ca
Wed Feb 4 17:13:43 CET 2009
On 2/4/2009 10:57 AM, luke at stat.uiowa.edu wrote:
> On Wed, 4 Feb 2009, Duncan Murdoch wrote:
>
>> One correction below, and a suggested alternative approach.
>>
>> On 2/4/2009 9:31 AM, Terry Therneau wrote:
>>> In R, functions remember their entire calling chain. The good thing
>>> about this is that they can find variables further up in the nested
>>> context, i.e.,
>>> mfun <- function(x) { x+y}
>>> will look for 'y' in the function that called myfun, then in the function
>>> that
>>> called the function, .... on up and then through the search() list. This
>>> makes
>>> life easier for certain things such as minimizers.
>>
>> This description is not right: it's not the caller, it's the environment
>> where mfun was created. So it applies to nested functions (as you said), but
>> the caller is irrelevant.
>>
>>>
>>> The bad thing is that to make this work R has to remember all of the
>>> variables that were available up the entire chain, and 99-100% of them
>>> aren't necessary. (Because of constructs like get(varname) a parser can't
>>> read the code to decide what might be needed).
>>
>> I'm not sure what you mean by "chain" here, but the real issue is that all
>> the variables in the function that creates mfun will be kept as long as mfun
>> exists.
>>
>>>
>>> This is an issue with embedded functions. I recently noticed an extreme
>>> case of it in the pspline routine and made changes to fix it. The short
>>> version
>>> pspline(x, ...other args) {
>>> some computations to define an X matrix, which can be large
>>> define a print function
>>> ...
>>> return(X, printfun, other stuff)
>>> }
>>
>> So here printfun captures all the local variables in pspline, even if it
>> doesn't need them.
>>
>>> It's even worse in the frailty functions, where X can be VERY large.
>>> The print function's environment wanted to 'remember' all of the temporary
>>> work that went into defining X, plus X itself and so would be huge. My
>>> solution was add the line
>>> environment(printfun) <- new.env(parent=baseenv())
>>> which marks the function as not needing anything from the local
>>> environment, only the base R definitions. This would probably be a good
>>> addition to rpart, but I need to look closer.
>>> My first cut was to use emptyenv(), but that wasn't so smart. It leaves
>>> everything undefined, like "+" for instance. :-)
>>
>> Another approach is simply to rm() the variables that aren't needed before
>> returning a function. For example, this function has locals x and y, but
>> only needs y for the returned function to work:
>>
>>> fnbuilder <- function(n) {
>> + x <- numeric(n)
>> + y <- numeric(n)
>> + noneedforx <- function() sum(y)
>> + rm(x)
>> + return(noneedforx)
>> + }
>>> f <- fnbuilder(10000)
>>> f()
>> [1] 0
>
> I would discourage the use of rm() here as it changes at runtime the
> variables that are defined for subsequent expressions. It isn't a
> problem here since nothing much happens after the rm but in general it
> can complicate reading the code for humans or analyzing the code
> programmatically. It is possible that using rm inside a function may
> not be fully supported under all circumstances in the future. (E.g. it
> might signal an error in compiled code or might inhibit useful
> compilation or something along those lines.)
An alternative could be x <- NULL. This keeps the disadvantage of
possibly messing up subsequent expressions, but it shrinks the allocation.
>
> My preference in situations where I need to control the captured
> environment is to lift the code constructing the closure to the top
> level of the package, so continuing with this example that would mean
> defining an auxiliary function that creates the closure, something
> like
>
> fnbuilder_y_only <- function(y)
> function() sum(y)
>
> fnbuilder <- function(n) {
> x <- numeric(n)
> y <- numeric(n)
> noneedforx <- fnbuilder_y_only(y)
> return(noneedforx)
> }
>
> This approach also has the advantage that the environment only
> captures what you explicitly provide, whereas with rm you risk
> forgetting to take out something large in more complicated code.
Of course that's a good way to control what gets captured, but I find it
makes things harder to understand. I think you've been working with
closures and lexical scope for quite a few years more than me so this
error wouldn't happen to you, but I'd be worried that I'd end up with
two differing copies of y in some later revision of the code.
For example, my version could put the noneedforx definition at the top
of the function and it would still work; yours needs to put it after the
last change to y.
Duncan Murdoch
> In principle it is possible to analyze the code of the closure
> function and only capture bindings that might be needed, but with R's
> semantics allowing functions to look into callers and such pretty much
> anything 'might be needed' unless we provide some sort of declaration
> mechanism for saying, for example, only explicitly referenced variables
> are to be considered needed.
>
> Best,
>
> luke
>
>>
>> To see what actually got carried along with f, use ls():
>>
>>> ls(environment(f))
>> [1] "n" "noneedforx" "y"
>>
>> So we've picked up the arg n, and our local copy of noneedforx, but we did
>> manage to get rid of x. (The local copy costs almost nothing: R will just
>> have another reference to the same object as f refers to. The arg could have
>> been rm'd too, if it was big enough to matter.)
>>
>> Duncan Murdoch
>>
>>> Terry Therneau
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
More information about the R-help
mailing list