[R] Large file size while persisting rpart model to disk

Duncan Murdoch murdoch at stats.uwo.ca
Wed Feb 4 17:13:43 CET 2009


On 2/4/2009 10:57 AM, luke at stat.uiowa.edu wrote:
> On Wed, 4 Feb 2009, Duncan Murdoch wrote:
> 
>> One correction below, and a suggested alternative approach.
>>
>> On 2/4/2009 9:31 AM, Terry Therneau wrote:
>>>   In R, functions remember their entire calling chain.  The good thing 
>>> about this is that they can find variables further up in the nested 
>>> context, i.e.,
>>>     mfun <- function(x) { x+y}
>>> will look for 'y' in the function that called myfun, then in the function 
>>> that
>>> called the function, .... on up and then through the search() list.  This 
>>> makes
>>> life easier for certain things such as minimizers.
>>
>> This description is not right: it's not the caller, it's the environment 
>> where mfun was created.  So it applies to nested functions (as you said), but 
>> the caller is irrelevant.
>>
>>>
>>>   The bad thing is that to make this work R has to remember all of the 
>>> variables that were available up the entire chain, and 99-100% of them 
>>> aren't necessary.  (Because of constructs like get(varname) a parser can't 
>>> read the code to decide what might be needed). 
>>
>> I'm not sure what you mean by "chain" here, but the real issue is that all 
>> the variables in the function that creates mfun will be kept as long as mfun 
>> exists.
>>
>>>
>>>   This is an issue with embedded functions.  I recently noticed an extreme 
>>> case of it in the pspline routine and made changes to fix it.  The short 
>>> version
>>>   	pspline(x, ...other args) {
>>>   		some computations to define an X matrix, which can be large
>>>   		define a print function
>>>   		...
>>>   		return(X, printfun, other stuff)
>>>   		}
>>
>> So here printfun captures all the local variables in pspline, even if it 
>> doesn't need them.
>>
>>> It's even worse in the frailty functions, where X can be VERY large.
>>> The print function's environment wanted to 'remember' all of the temporary 
>>> work that went into defining X, plus X itself and so would be huge.  My 
>>> solution was add the line
>>> 	environment(printfun) <- new.env(parent=baseenv())
>>> which marks the function as not needing anything from the local 
>>> environment, only the base R definitions.  This would probably be a good 
>>> addition to rpart, but I need to look closer.
>>>    My first cut was to use emptyenv(), but that wasn't so smart.  It leaves 
>>> everything undefined, like "+" for instance. :-)
>>
>> Another approach is simply to rm() the variables that aren't needed before 
>> returning a function.  For example, this function has locals x and y, but 
>> only needs y for the returned function to work:
>>
>>> fnbuilder <- function(n) {
>> +    x <- numeric(n)
>> +    y <- numeric(n)
>> +    noneedforx <- function() sum(y)
>> +    rm(x)
>> +    return(noneedforx)
>> + }
>>> f <- fnbuilder(10000)
>>> f()
>> [1] 0
> 
> I would discourage the use of rm() here as it changes at runtime the
> variables that are defined for subsequent expressions.  It isn't a
> problem here since nothing much happens after the rm but in general it
> can complicate reading the code for humans or analyzing the code
> programmatically.  It is possible that using rm inside a function may
> not be fully supported under all circumstances in the future. (E.g. it
> might signal an error in compiled code or might inhibit useful
> compilation or something along those lines.)

An alternative could be x <- NULL.  This keeps the disadvantage of 
possibly messing up subsequent expressions, but it shrinks the allocation.

> 
> My preference in situations where I need to control the captured
> environment is to lift the code constructing the closure to the top
> level of the package, so continuing with this example that would mean
> defining an auxiliary function that creates the closure, something
> like
> 
>      fnbuilder_y_only <- function(y)
>  	function() sum(y)
> 
>      fnbuilder <- function(n) {
>  	x <- numeric(n)
>  	y <- numeric(n)
>  	noneedforx <- fnbuilder_y_only(y)
>  	return(noneedforx)
>      }
> 
> This approach also has the advantage that the environment only
> captures what you explicitly provide, whereas with rm you risk
> forgetting to take out something large in more complicated code.

Of course that's a good way to control what gets captured, but I find it 
makes things harder to understand.  I think you've been working with 
closures and lexical scope for quite a few years more than me so this 
error wouldn't happen to you, but I'd be worried that I'd end up with 
two differing copies of y in some later revision of the code.

For example, my version could put the noneedforx definition at the top 
of the function and it would still work; yours needs to put it after the 
last change to y.

Duncan Murdoch

> In principle it is possible to analyze the code of the closure
> function and only capture bindings that might be needed, but with R's
> semantics allowing functions to look into callers and such pretty much
> anything 'might be needed' unless we provide some sort of declaration
> mechanism for saying, for example, only explicitly referenced variables
> are to be considered needed.
> 
> Best,
> 
> luke
> 
>>
>> To see what actually got carried along with f, use ls():
>>
>>> ls(environment(f))
>> [1] "n"          "noneedforx" "y"
>>
>> So we've picked up the arg n, and our local copy of noneedforx, but we did 
>> manage to get rid of x.  (The local copy costs almost nothing:  R will just 
>> have another reference to the same object as f refers to.  The arg could have 
>> been rm'd too, if it was big enough to matter.)
>>
>> Duncan Murdoch
>>
>>>       	Terry Therneau
>>> 
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide 
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>




More information about the R-help mailing list