[Rd] Doing the right amount of copy for large data frames.
Prof Brian Ripley
ripley at stats.ox.ac.uk
Tue Apr 15 09:03:51 CEST 2008
On Mon, 14 Apr 2008, Gopi Goswami wrote:
> Dear All,
>
>
> Thanks a lot for your helpful comments (e.g., NAMED, ExpressionSet,
> DNAStringSet).
>
>
> Observations and questions ::
>
> ooo For a data.frame dd and a list ll with same contents to being with,
> the following operations show significant difference in the maximum memory
> usage column of the gc( ) output on R-2.6.2 (the detailed code is in the PS
> section below).
>
> ll$xx <- zz
> dd$xx <- zz
>
> My understanding is that the '$<-.data.frame' S3 method above makes a copy
> of the whole dd first (using '*tmp*'). But for a list this is avoided due to
> the use of SET_VECTOR_ELT at the C-level. Is this a valid explanation or
> something deeper is happening behind the scene?
Something deeper -- see the 'R Internals' manual. '$<-' is primitive --
its methods are not. For the list the copy *may* be avoided, if 'll' is
the only reference to that object and R has never thought there might be
another.
> ooo I'll look into the read-only flag idea to avoid unhappy circumstances
> that might arise while bypassing the copy-on-modify principle. Any pointers
> or code snippets as to how to implement this idea?
>
>
>
> ooo The main reason I want to bypass copy-on-modify is that I want to
> emulate a Python like behavior for lists (and data.frame), in the sense
> that, I want to take the responsibility of making a deep copy if need be,
> but most of the time I want to knowingly change 'things in place' using the
> proposed S4 class DataFrame.
>
>
> Regards,
> Gopi Goswami.
> PhD, Statistics, 2005
> http://gopi-goswami.net/index.html
>
>
>
> PS:
>
> zz <- seq_len(1000000)
> gc( )
> dd <- data.frame(xx = zz)
> dd$yy <- zz
> gc( )
> object.size(dd)
>
> ######################################################################
>
> zz <- seq_len(1000000)
> gc( )
> ll <- list(xx = zz)
> ll$yy <- zz
> gc( )
> object.size(ll)
>
>
>
>
> On Mon, Apr 14, 2008 at 10:18 AM, Tony Plate <tplate at acm.org> wrote:
>
>> Gopi Goswami wrote:
>>
>>> Hi there,
>>>
>>>
>>> Problem ::
>>> When one tries to change one or some of the columns of a data.frame, R
>>> makes
>>> a copy of the whole data.frame using the '*tmp*' mechanism (this does
>>> not
>>> happen for components of a list, tracemem( ) on R-2.6.2 says so).
>>>
>>>
>>> Suggested solution ::
>>> Store the columns of the data.frame as a list inside of an environment
>>> slot
>>> of an S4 class, and define the '[', '[<-' etc. operators using
>>> setMethod( )
>>> and setReplaceMethod( ).
>>>
>>>
>>> Question ::
>>> This implementation will violate copy on modify principle of R (since
>>> environments are not copied), but will save a lot of memory. Do you see
>>> any
>>> other obvious problem(s) with the idea?
>>>
>> Well, because it violates the copy-on-modify principle it can potentially
>> break code that depends on this principle. I don't know how much there is
>> -- did you try to see if R and recommended packages will pass checks with
>> this change in place?
>>
>>> Have you seen a related setup
>>> implemented / considered before (apart from the packages like filehash,
>>> ff,
>>> and database related ones for saving memory)?
>>>
>>>
>> I've frequently used a personal package that stores array data in a file
>> (like ff). It works fine, and I partially get around the problem of
>> violating the copy-on-modify principle by having a readonly flag in the
>> object -- when the flag is set to allow modification I have to be careful,
>> but after I set it to readonly I can use it more freely with the knowledge
>> that if some function does attempt to modify the object, it will stop with
>> an error.
>>
>> In this particular case, why not just track down why data frame
>> modification is copying the entire object and suggest a change so that it
>> just copies the column being changed? (should be possible if list
>> modification doesn't copy all components).
>>
>> -- Tony Plate
>>
>>>
>>> Implementation code snippet ::
>>> ### The S4 class.
>>> setClass('DataFrame',
>>> representation(data = 'data.frame', nrow = 'numeric', ncol
>>> =
>>> 'numeric', store = 'environment'),
>>> prototype(data = data.frame( ), nrow = 0, ncol = 0))
>>>
>>> setMethod('initialize', 'DataFrame', function(.Object) {
>>> .Object <- callNextMethod( )
>>> .Object at store <- new.env(hash = TRUE)
>>> assign('data', as.list(.Object at data), .Object at store)
>>> .Object at nrow <- nrow(.Object at data)
>>> .Object at ncol <- ncol(.Object at data)
>>> .Object at data <- data.frame( )
>>> .Object
>>> })
>>>
>>>
>>> ### Usage:
>>> nn <- 10
>>> ## dd1 below could possibly be created by read.table or scan and
>>> data.frame
>>> dd1 <- data.frame(xx = rnorm(nn), yy = rnorm(nn))
>>> dd2 <- new('DataFrame', data = dd1)
>>> rm(dd1)
>>> ## Now work with dd2
>>>
>>>
>>> Thanks a lot,
>>> Gopi Goswami.
>>> PhD, Statistics, 2005
>>> http://gopi-goswami.net/index.html
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>>
>>>
>>
>>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
--
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
More information about the R-devel
mailing list