[Rd] Doing the right amount of copy for large data frames.
Tony Plate
tplate at acm.org
Mon Apr 14 16:18:47 CEST 2008
Gopi Goswami wrote:
> Hi there,
>
>
> Problem ::
> When one tries to change one or some of the columns of a data.frame, R makes
> a copy of the whole data.frame using the '*tmp*' mechanism (this does not
> happen for components of a list, tracemem( ) on R-2.6.2 says so).
>
>
> Suggested solution ::
> Store the columns of the data.frame as a list inside of an environment slot
> of an S4 class, and define the '[', '[<-' etc. operators using setMethod( )
> and setReplaceMethod( ).
>
>
> Question ::
> This implementation will violate copy on modify principle of R (since
> environments are not copied), but will save a lot of memory. Do you see any
> other obvious problem(s) with the idea?
Well, because it violates the copy-on-modify principle it can
potentially break code that depends on this principle. I don't know how
much there is -- did you try to see if R and recommended packages will
pass checks with this change in place?
> Have you seen a related setup
> implemented / considered before (apart from the packages like filehash, ff,
> and database related ones for saving memory)?
>
I've frequently used a personal package that stores array data in a file
(like ff). It works fine, and I partially get around the problem of
violating the copy-on-modify principle by having a readonly flag in the
object -- when the flag is set to allow modification I have to be
careful, but after I set it to readonly I can use it more freely with
the knowledge that if some function does attempt to modify the object,
it will stop with an error.
In this particular case, why not just track down why data frame
modification is copying the entire object and suggest a change so that
it just copies the column being changed? (should be possible if list
modification doesn't copy all components).
-- Tony Plate
>
> Implementation code snippet ::
> ### The S4 class.
> setClass('DataFrame',
> representation(data = 'data.frame', nrow = 'numeric', ncol =
> 'numeric', store = 'environment'),
> prototype(data = data.frame( ), nrow = 0, ncol = 0))
>
> setMethod('initialize', 'DataFrame', function(.Object) {
> .Object <- callNextMethod( )
> .Object at store <- new.env(hash = TRUE)
> assign('data', as.list(.Object at data), .Object at store)
> .Object at nrow <- nrow(.Object at data)
> .Object at ncol <- ncol(.Object at data)
> .Object at data <- data.frame( )
> .Object
> })
>
>
> ### Usage:
> nn <- 10
> ## dd1 below could possibly be created by read.table or scan and data.frame
> dd1 <- data.frame(xx = rnorm(nn), yy = rnorm(nn))
> dd2 <- new('DataFrame', data = dd1)
> rm(dd1)
> ## Now work with dd2
>
>
> Thanks a lot,
> Gopi Goswami.
> PhD, Statistics, 2005
> http://gopi-goswami.net/index.html
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>
More information about the R-devel
mailing list