[Rd] Doing the right amount of copy for large data frames.

Mon Apr 14 17:59:15 CEST 2008

Hi Gopi

"Gopi Goswami" <grgoswami at gmail.com> writes:

> Hi there,
>
>
> Problem ::
> When one tries to change one or some of the columns of a data.frame, R makes
> a copy of the whole data.frame using the '*tmp*' mechanism (this does not
> happen for components of a list, tracemem( ) on R-2.6.2 says so).
>
>
> Suggested solution ::
> Store the columns of the data.frame as a list inside of an environment slot
> of an S4 class, and define the '[', '[<-' etc. operators using setMethod( )
> and setReplaceMethod( ).

The Biocondcutor package Biobase has a class 'ExpressionSet' with slot
assayData. By default assayData is an environment that is 'locked' so
can't be modified casually. The interface to ExpressionSet unlocks the
environment, and copies and modifies it when necessary. This is not
quite the same as you propose, but has some similar characteristics.

I've spent a lot of time with this data structure, and think this
borders on one of those ideas that 'seemed like a good idea at the
time'. You end up using R-level tools to manage memory. Copy-on-change
is better than you might naively think at not making unnecessary
copies. S4 caries significant overhead, including copies during method
dispatch, that work against you (subsetting an expression set in an
OOP way, no behind-the-scenes tricks, makes *5* copies of the S4
instance, though perhaps these are light-weight because the big data
is in an environment). And in the mean time computers have gotten
faster and bigger, and the 'big' data of ExpressionSets are now only
modestly sized or even small.

A somewhat different approach is in the Biostrings package, for
instance DNAStringSet, where the original object is 'read-only'. The
user is presented with a 'view' into the object; changing the view
(subsetting) changes the indicies in the view but not the original
data. This is both fast and memory efficient. This is a read-only
solution, though.

Hope that helps, Martin

> Question ::
> This implementation will violate copy on modify principle of R (since
> environments are not copied), but will save a lot of memory. Do you see any
> other obvious problem(s) with the idea? Have you seen a related setup
> implemented / considered before (apart from the packages like filehash, ff,
> and database related ones for saving memory)?
>
>
> Implementation code snippet ::
> ### The S4 class.
> setClass('DataFrame',
>               representation(data = 'data.frame', nrow = 'numeric', ncol =
> 'numeric', store = 'environment'),
>               prototype(data = data.frame( ), nrow = 0, ncol = 0))
>
> setMethod('initialize', 'DataFrame', function(.Object) {
>     .Object <- callNextMethod( )
>     .Object at store <- new.env(hash = TRUE)
>     assign('data', as.list(.Object at data), .Object at store)
>     .Object at nrow <- nrow(.Object at data)
>     .Object at ncol <- ncol(.Object at data)
>     .Object at data <- data.frame( )
>     .Object
> })
>
>
> ### Usage:
> nn  <- 10
> ## dd1 below could possibly be created by read.table or scan and data.frame
> dd1 <- data.frame(xx = rnorm(nn), yy = rnorm(nn))
> dd2 <- new('DataFrame', data = dd1)
> rm(dd1)
> ## Now work with dd2
>
>
> Thanks a lot,
> Gopi Goswami.
> PhD, Statistics, 2005
> http://gopi-goswami.net/index.html
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M2 B169
Phone: (206) 667-2793