[Rd] Doing the right amount of copy for large data frames.

Mon Apr 14 16:18:47 CEST 2008

Gopi Goswami wrote:
> Hi there,
>
>
> Problem ::
> When one tries to change one or some of the columns of a data.frame, R makes
> a copy of the whole data.frame using the '*tmp*' mechanism (this does not
> happen for components of a list, tracemem( ) on R-2.6.2 says so).
>
>
> Suggested solution ::
> Store the columns of the data.frame as a list inside of an environment slot
> of an S4 class, and define the '[', '[<-' etc. operators using setMethod( )
> and setReplaceMethod( ).
>
>
> Question ::
> This implementation will violate copy on modify principle of R (since
> environments are not copied), but will save a lot of memory. Do you see any
> other obvious problem(s) with the idea?
Well, because it violates the copy-on-modify principle it can 
potentially break code that depends on this principle.  I don't know how 
much there is -- did you try to see if R and recommended packages will 
pass checks with this change in place?
>  Have you seen a related setup
> implemented / considered before (apart from the packages like filehash, ff,
> and database related ones for saving memory)?
>   
I've frequently used a personal package that stores array data in a file 
(like ff).  It works fine, and I partially get around the problem of 
violating the copy-on-modify principle by having a readonly flag in the 
object -- when the flag is set to allow modification I have to be 
careful, but after I set it to readonly I can use it more freely with 
the knowledge that if some function does attempt to modify the object, 
it will stop with an error.

In this particular case, why not just track down why data frame 
modification is copying the entire object and suggest a change so that 
it just copies the column being changed?  (should be possible if list 
modification doesn't copy all components).

-- Tony Plate
>
> Implementation code snippet ::
> ### The S4 class.
> setClass('DataFrame',
>               representation(data = 'data.frame', nrow = 'numeric', ncol =
> 'numeric', store = 'environment'),
>               prototype(data = data.frame( ), nrow = 0, ncol = 0))
>
> setMethod('initialize', 'DataFrame', function(.Object) {
>     .Object <- callNextMethod( )
>     .Object at store <- new.env(hash = TRUE)
>     assign('data', as.list(.Object at data), .Object at store)
>     .Object at nrow <- nrow(.Object at data)
>     .Object at ncol <- ncol(.Object at data)
>     .Object at data <- data.frame( )
>     .Object
> })
>
>
> ### Usage:
> nn  <- 10
> ## dd1 below could possibly be created by read.table or scan and data.frame
> dd1 <- data.frame(xx = rnorm(nn), yy = rnorm(nn))
> dd2 <- new('DataFrame', data = dd1)
> rm(dd1)
> ## Now work with dd2
>
>
> Thanks a lot,
> Gopi Goswami.
> PhD, Statistics, 2005
> http://gopi-goswami.net/index.html
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>