[Rd] Light-weight data.frame class: was: how to add method to .Primitive function

Vadim Ogranovich vograno at evafunds.com
Tue May 10 20:27:50 CEST 2005


Thanks again! BTW, how did you find the code for "[.default"? I tried:
> get("[.default")
Error in get(x, envir, mode, inherits) : variable "[.default" was not
found 

> -----Original Message-----
> From: Gabor Grothendieck [mailto:ggrothendieck at gmail.com] 
> Sent: Monday, May 09, 2005 9:46 PM
> To: Vadim Ogranovich
> Cc: r-devel at stat.math.ethz.ch; simon.urbanek at r-project.org
> Subject: Re: [Rd] Light-weight data.frame class: was: how to 
> add method to .Primitive function
> 
> "[.default" is implemented in R as .subset.  See ?.subset and 
> note that it begins with a dot.  e.g. for the case where i 
> and j are not missing:
> 
> "[.lwdf" <- function(x, i, j) lapply(.subset(x,j), "[", i)
> 
> 
> 
> On 5/8/05, Vadim Ogranovich <vograno at evafunds.com> wrote:
> > Hi,
> > 
> > Encouraged by a tip from Simon Urbanek I tried to use the 
> S3 machinery 
> > to write a faster version of the data.frame class.
> > This quickly hits a snag: the "[.default"(x, i) for some 
> reason cares 
> > about the dimensionality of x.
> > In the end there is a full transcript of my R session. It 
> includes the 
> > motivation for writing the class and the problems I have 
> encountered.
> > 
> > As a result I see three issues here:
> > * why "[.default"(x, i) doesn't work if dim(x) is 2? After all a 
> > single subscript into a vector works regardless of whether it's a 
> > matrix or not. Is there an alternative way to access "[.default"?
> > * why does unclass() make deep copy? This is a facet of the global 
> > over-conservatism of R with respect to copying.
> > * is it possible to add some sort copy profiling to R? 
> Something like 
> > copyProfiling(TRUE), which should cause R to log sizes of 
> each copied 
> > object (just raw sizes w/o any attempt to identify the 
> object). This 
> > feature should at least help assess the magnitude of the problem.
> > 
> > Thanks,
> > Vadim
> > 
> > Now the transcript itself:
> > > # the motivation: subscription of a data.frame is *much* 
> (almost 20
> > times) slower than that of a list
> > > # compare
> > > n = 1e6
> > > i = seq(n)
> > >
> > > x = data.frame(a=seq(n), b=seq(n))
> > > system.time(x[i,], gcFirst=TRUE)
> > [1] 1.01 0.14 1.14 0.00 0.00
> > >
> > > x = list(a=seq(n), b=seq(n))
> > > system.time(lapply(x, function(col) col[i]), gcFirst=TRUE)
> > [1] 0.06 0.00 0.06 0.00 0.00
> > >
> > >
> > > # the solution: define methods for the light-weight 
> data.frame class 
> > > lwdf = function(...) structure(list(...), class = "lwdf")
> > >
> > > # dim
> > > dim.lwdf = function(x) c(length(x[[1]]), length(x))
> > >
> > > # for pretty printing we define print.lwdf via a conversion to
> > data.frame
> > > # as.data.frame.lwdf
> > > as.data.frame.lwdf = function(x) structure(unclass(x),
> > class="data.frame", row.names=as.character(seq(nrow(x))))
> > >
> > > # print
> > > print.lwdf = function(x) print.data.frame(as.data.frame.lwdf(x))
> > >
> > > # now the real stuff
> > >
> > > # "["
> > > # the naive "[.lwdf" = function (x, i, j) lapply(x[j], 
> function(col)
> > col[i])
> > > # won't work because evaluation of x[j] calls "[.lwdf" 
> again and not
> > "[.default"
> > > # so we switch by the number of arguments "[.lwdf" = 
> function (x, i, 
> > > j) {
> > +   if (nargs() == 2)
> > +     NextMethod("[", x, i)
> > +   else
> > +     structure(lapply(x[j], function(col) col[i]),  class 
> = "lwdf") }
> > >
> > > x = lwdf(a=seq(3), b=letters[seq(3)], 
> c=as.factor(letters[seq(3)])) 
> > > i = c(1,3); j = c(1,3)
> > >
> > > # unfortunately, for some reasons "[.default" cares about
> > dimensionality of its argument
> > > x[i,j]
> > Error in "[.default"(x, j) : incorrect number of dimensions
> > >
> > >
> > > # we could use unclass to get it right "[.lwdf" = 
> function (x, i, j) 
> > > {
> > +   structure(lapply(unclass(x)[j], function(col) col[i]),  class =
> > "lwdf")
> > + }
> > >
> > > x[i,j]
> >  a c
> > 1 1 a
> > 2 3 c
> > >
> > > # *but* unclass creates a deep copy of its argument as indirectly
> > evidenced by the following timing
> > > x = lwdf(a=seq(1e6)); system.time(unclass(x))
> > [1] 0.01 0.00 0.01 0.00 0.00
> > > x = lwdf(a=seq(1e8)); system.time(unclass(x))
> > [1] 0.44 0.39 0.82 0.00 0.00
> > 
> > > version
> >         _
> > platform x86_64-unknown-linux-gnu
> > arch     x86_64
> > os       linux-gnu
> > system   x86_64, linux-gnu
> > status
> > major    2
> > minor    0.1
> > year     2004
> > month    11
> > day      15
> > language R
> > 
> > ______________________________________________
> > R-devel at stat.math.ethz.ch mailing list 
> > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
>



More information about the R-devel mailing list