[Rd] [datatable-help] speeding up perception

Matthew Dowle mdowle at mdowle.plus.com
Tue Jul 5 20:08:23 CEST 2011


Simon (and all),

I've tried to make assignment as fast as calling `[<-.data.table`
directly, for user convenience. Profiling shows (IIUC) that it isn't
dispatch, but x being copied. Is there a way to prevent '[<-' from
copying x?  Small reproducible example in vanilla R 2.13.0 :

> x = list(a=1:10000,b=1:10000)
> class(x) = "newclass"
> "[<-.newclass" = function(x,i,j,value) x      # i.e. do nothing
> tracemem(x)
[1] "<0xa1ec758>"
> x[1,2] = 42L
tracemem[0xa1ec758 -> 0xa1ec558]:    # but, x is still copied, why?
> 

I've tried returning NULL from [<-.newclass but then x gets assigned
NULL :

> "[<-.newclass" = function(x,i,j,value) NULL
> x[1,2] = 42L
tracemem[0xa1ec558 -> 0x9c5f318]: 
> x
NULL
> 

Any pointers much appreciated. If that copy is preventable it should
save the user needing to use `[<-.data.table`(...) syntax to get the
best speed (20 times faster on the small example used so far).

Matthew


On Tue, 2011-07-05 at 08:32 +0100, Matthew Dowle wrote:
> Simon,
> 
> Thanks for the great suggestion. I've written a skeleton assignment
> function for data.table which incurs no copies, which works for this
> case. For completeness, if I understand correctly, this is for : 
>   i) convenience of new users who don't know how to vectorize yet
>   ii) more complex examples which can't be vectorized.
> 
> Before:
> 
> > system.time(for (r in 1:R) DT[r,20] <- 1.0)
>    user  system elapsed 
>  12.792   0.488  13.340 
> 
> After :
> 
> > system.time(for (r in 1:R) DT[r,20] <- 1.0)
>    user  system elapsed 
>   2.908   0.020   2.935
> 
> Where this can be reduced further as follows :
> 
> > system.time(for (r in 1:R) `[<-.data.table`(DT,r,2,1.0))
>    user  system elapsed 
>   0.132   0.000   0.131 
> > 
> 
> Still working on it. When it doesn't break other data.table tests, I'll
> commit to R-Forge ...
> 
> Matthew
> 
> 
> On Mon, 2011-07-04 at 12:41 -0400, Simon Urbanek wrote:
> > Timothée,
> > 
> > On Jul 4, 2011, at 2:47 AM, Timothée Carayol wrote:
> > 
> > > Hi --
> > > 
> > > It's my first post on this list; as a relatively new user with little
> > > knowledge of R internals, I am a bit intimidated by the depth of some
> > > of the discussions here, so please spare me if I say something
> > > incredibly silly.
> > > 
> > > I feel that someone at this point should mention Matthew Dowle's
> > > excellent data.table package
> > > (http://cran.r-project.org/web/packages/data.table/index.html) which
> > > seems to me to address many of the inefficiencies of data.frame.
> > > data.tables have no row names; and operations that only need data from
> > > one or two columns are (I believe) just as quick whether the total
> > > number of columns is 5 or 1000. This results in very quick operations
> > > (and, often, elegant code as well).
> > > 
> > 
> > I agree that data.table is a very good alternative (for other reasons) that should be promoted more. The only slight snag is that it doesn't help with the issue at hand since it simply does a pass-though for subassignments to data frame's methods and thus suffers from the same problems (in fact there is a rather stark asymmetry in how it handles subsetting vs subassignment - which is a bit surprising [if I read the code correctly you can't use the same indexing in both]). In fact I would propose that it should not do that but handle the simple cases itself more efficiently without unneeded copies. That would make it indeed a very interesting alternative.
> > 
> > Cheers,
> > Simon
> > 
> > 
> > > 
> > > On Mon, Jul 4, 2011 at 6:19 AM, ivo welch <ivo.welch at gmail.com> wrote:
> > >> thank you, simon.  this was very interesting indeed.  I also now
> > >> understand how far out of my depth I am here.
> > >> 
> > >> fortunately, as an end user, obviously, *I* now know how to avoid the
> > >> problem.  I particularly like the as.list() transformation and back to
> > >> as.data.frame() to speed things up without loss of (much)
> > >> functionality.
> > >> 
> > >> 
> > >> more broadly, I view the avoidance of individual access through the
> > >> use of apply and vector operations as a mixed "IQ test" and "knowledge
> > >> test" (which I often fail).  However, even for the most clever, there
> > >> are also situations where the KISS programming principle makes
> > >> explicit loops still preferable.  Personally, I would have preferred
> > >> it if R had, in its standard "statistical data set" data structure,
> > >> foregone the row names feature in exchange for retaining fast direct
> > >> access.  R could have reserved its current implementation "with row
> > >> names but slow access" for a less common (possibly pseudo-inheriting)
> > >> data structure.
> > >> 
> > >> 
> > >> If end users commonly do iterations over a data frame, which I would
> > >> guess to be the case, then the impression of R by (novice) end users
> > >> could be greatly enhanced if the extreme penalties could be eliminated
> > >> or at least flagged.  For example, I wonder if modest special internal
> > >> code could store data frames internally and transparently as lists of
> > >> vectors UNTIL a row name is assigned to.  Easier and uglier, a simple
> > >> but specific warning message could be issued with a suggestion if
> > >> there is an individual read/write into a data frame ("Warning: data
> > >> frames are much slower than lists of vectors for individual element
> > >> access").
> > >> 
> > >> 
> > >> I would also suggest changing the "Introduction to R" 6.3  from "A
> > >> data frame may for many purposes be regarded as a matrix with columns
> > >> possibly of differing modes and attributes. It may be displayed in
> > >> matrix form, and its rows and columns extracted using matrix indexing
> > >> conventions." to "A data frame may for many purposes be regarded as a
> > >> matrix with columns possibly of differing modes and attributes. It may
> > >> be displayed in matrix form, and its rows and columns extracted using
> > >> matrix indexing conventions.  However, data frames can be much slower
> > >> than matrices or even lists of vectors (which, like data frames, can
> > >> contain different types of columns) when individual elements need to
> > >> be accessed."  Reading about it immediately upon introduction could
> > >> flag the problem in a more visible manner.
> > >> 
> > >> 
> > >> regards,
> > >> 
> > >> /iaw
> > >> 
> > >> ______________________________________________
> > >> R-devel at r-project.org mailing list
> > >> https://stat.ethz.ch/mailman/listinfo/r-devel
> > >> 
> > > 
> > > ______________________________________________
> > > R-devel at r-project.org mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-devel
> > > 
> > > 
> > 
> > _______________________________________________
> > datatable-help mailing list
> > datatable-help at lists.r-forge.r-project.org
> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> 
> 
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help



More information about the R-devel mailing list