[Rd] [datatable-help] speeding up perception

luke-tierney at uiowa.edu luke-tierney at uiowa.edu
Wed Jul 6 01:18:58 CEST 2011


On Tue, 5 Jul 2011, Simon Urbanek wrote:

>
> On Jul 5, 2011, at 2:08 PM, Matthew Dowle wrote:
>
>> Simon (and all),
>>
>> I've tried to make assignment as fast as calling `[<-.data.table`
>> directly, for user convenience. Profiling shows (IIUC) that it isn't
>> dispatch, but x being copied. Is there a way to prevent '[<-' from
>> copying x?
>
> Good point, and conceptually, no. It's a subassignment after all - see R-lang 3.4.4 - it is equivalent to
>
> `*tmp*` <- x
> x <- `[<-`(`*tmp*`, i, j, value)
> rm(`*tmp*`)
>
> so there is always a copy involved.
>
> Now, a conceptual copy doesn't mean real copy in R since R tries to keep the pass-by-value illusion while passing references in cases where it knows that modifications cannot occur and/or they are safe. The default subassign method uses that feature which means it can afford to not duplicate if there is only one reference -- then it's safe to not duplicate as we are replacing that only existing reference. And in the case of a matrix, that will be true at the latest from the second subassignment on.
>
> Unfortunately the method dispatch (AFAICS) introduces one more reference in the dispatch chain so there will always be two references so duplication is necessary. Since we have only 0 / 1 / 2+ information on the references, we can't distinguish whether the second reference is due to the dispatch or due to the passed object having more than one reference, so we have to duplicate in any case. That is unfortunate, and I don't see a way around (unless we handle subassignment methods is some special way).

I don't believe dispatch is bumping NAMED (and a quick experiment
seems to confirm this though I don't guarantee I did that right). The
issue is that a replacement function implemented as a closure, which
is the only option for a package, will always see NAMED on the object
to be modified as 2 (because the value is obtained by forcing the
argument promise) and so any R level assignments will duplicate.  This
also isn't really an issue of imprecise reference counting -- there
really are (at least) two legitimate references -- one though the
argument and one through the caller's environment.

It would be good it we could come up with a way for packages to be
able to define replacement functions that do not duplicate in cases
where we really don't want them to, but this would require coming up
with some sort of protocol, minimally involving an efficient way to
detect whether a replacement funciton is bing called in a replacement
context or directly.

There are some replacement functions that use C code to cheat, but
these may create problems if called directly, so I won't advertise
them.

Best,

luke

>
> Cheers,
> Simon
>
>
>
>>  Small reproducible example in vanilla R 2.13.0 :
>>
>>> x = list(a=1:10000,b=1:10000)
>>> class(x) = "newclass"
>>> "[<-.newclass" = function(x,i,j,value) x      # i.e. do nothing
>>> tracemem(x)
>> [1] "<0xa1ec758>"
>>> x[1,2] = 42L
>> tracemem[0xa1ec758 -> 0xa1ec558]:    # but, x is still copied, why?
>>>
>>
>> I've tried returning NULL from [<-.newclass but then x gets assigned
>> NULL :
>>
>>> "[<-.newclass" = function(x,i,j,value) NULL
>>> x[1,2] = 42L
>> tracemem[0xa1ec558 -> 0x9c5f318]:
>>> x
>> NULL
>>>
>>
>> Any pointers much appreciated. If that copy is preventable it should
>> save the user needing to use `[<-.data.table`(...) syntax to get the
>> best speed (20 times faster on the small example used so far).
>>
>> Matthew
>>
>>
>> On Tue, 2011-07-05 at 08:32 +0100, Matthew Dowle wrote:
>>> Simon,
>>>
>>> Thanks for the great suggestion. I've written a skeleton assignment
>>> function for data.table which incurs no copies, which works for this
>>> case. For completeness, if I understand correctly, this is for :
>>>  i) convenience of new users who don't know how to vectorize yet
>>>  ii) more complex examples which can't be vectorized.
>>>
>>> Before:
>>>
>>>> system.time(for (r in 1:R) DT[r,20] <- 1.0)
>>>   user  system elapsed
>>> 12.792   0.488  13.340
>>>
>>> After :
>>>
>>>> system.time(for (r in 1:R) DT[r,20] <- 1.0)
>>>   user  system elapsed
>>>  2.908   0.020   2.935
>>>
>>> Where this can be reduced further as follows :
>>>
>>>> system.time(for (r in 1:R) `[<-.data.table`(DT,r,2,1.0))
>>>   user  system elapsed
>>>  0.132   0.000   0.131
>>>>
>>>
>>> Still working on it. When it doesn't break other data.table tests, I'll
>>> commit to R-Forge ...
>>>
>>> Matthew
>>>
>>>
>>> On Mon, 2011-07-04 at 12:41 -0400, Simon Urbanek wrote:
>>>> Timothée,
>>>>
>>>> On Jul 4, 2011, at 2:47 AM, Timothée Carayol wrote:
>>>>
>>>>> Hi --
>>>>>
>>>>> It's my first post on this list; as a relatively new user with little
>>>>> knowledge of R internals, I am a bit intimidated by the depth of some
>>>>> of the discussions here, so please spare me if I say something
>>>>> incredibly silly.
>>>>>
>>>>> I feel that someone at this point should mention Matthew Dowle's
>>>>> excellent data.table package
>>>>> (http://cran.r-project.org/web/packages/data.table/index.html) which
>>>>> seems to me to address many of the inefficiencies of data.frame.
>>>>> data.tables have no row names; and operations that only need data from
>>>>> one or two columns are (I believe) just as quick whether the total
>>>>> number of columns is 5 or 1000. This results in very quick operations
>>>>> (and, often, elegant code as well).
>>>>>
>>>>
>>>> I agree that data.table is a very good alternative (for other reasons) that should be promoted more. The only slight snag is that it doesn't help with the issue at hand since it simply does a pass-though for subassignments to data frame's methods and thus suffers from the same problems (in fact there is a rather stark asymmetry in how it handles subsetting vs subassignment - which is a bit surprising [if I read the code correctly you can't use the same indexing in both]). In fact I would propose that it should not do that but handle the simple cases itself more efficiently without unneeded copies. That would make it indeed a very interesting alternative.
>>>>
>>>> Cheers,
>>>> Simon
>>>>
>>>>
>>>>>
>>>>> On Mon, Jul 4, 2011 at 6:19 AM, ivo welch <ivo.welch at gmail.com> wrote:
>>>>>> thank you, simon.  this was very interesting indeed.  I also now
>>>>>> understand how far out of my depth I am here.
>>>>>>
>>>>>> fortunately, as an end user, obviously, *I* now know how to avoid the
>>>>>> problem.  I particularly like the as.list() transformation and back to
>>>>>> as.data.frame() to speed things up without loss of (much)
>>>>>> functionality.
>>>>>>
>>>>>>
>>>>>> more broadly, I view the avoidance of individual access through the
>>>>>> use of apply and vector operations as a mixed "IQ test" and "knowledge
>>>>>> test" (which I often fail).  However, even for the most clever, there
>>>>>> are also situations where the KISS programming principle makes
>>>>>> explicit loops still preferable.  Personally, I would have preferred
>>>>>> it if R had, in its standard "statistical data set" data structure,
>>>>>> foregone the row names feature in exchange for retaining fast direct
>>>>>> access.  R could have reserved its current implementation "with row
>>>>>> names but slow access" for a less common (possibly pseudo-inheriting)
>>>>>> data structure.
>>>>>>
>>>>>>
>>>>>> If end users commonly do iterations over a data frame, which I would
>>>>>> guess to be the case, then the impression of R by (novice) end users
>>>>>> could be greatly enhanced if the extreme penalties could be eliminated
>>>>>> or at least flagged.  For example, I wonder if modest special internal
>>>>>> code could store data frames internally and transparently as lists of
>>>>>> vectors UNTIL a row name is assigned to.  Easier and uglier, a simple
>>>>>> but specific warning message could be issued with a suggestion if
>>>>>> there is an individual read/write into a data frame ("Warning: data
>>>>>> frames are much slower than lists of vectors for individual element
>>>>>> access").
>>>>>>
>>>>>>
>>>>>> I would also suggest changing the "Introduction to R" 6.3  from "A
>>>>>> data frame may for many purposes be regarded as a matrix with columns
>>>>>> possibly of differing modes and attributes. It may be displayed in
>>>>>> matrix form, and its rows and columns extracted using matrix indexing
>>>>>> conventions." to "A data frame may for many purposes be regarded as a
>>>>>> matrix with columns possibly of differing modes and attributes. It may
>>>>>> be displayed in matrix form, and its rows and columns extracted using
>>>>>> matrix indexing conventions.  However, data frames can be much slower
>>>>>> than matrices or even lists of vectors (which, like data frames, can
>>>>>> contain different types of columns) when individual elements need to
>>>>>> be accessed."  Reading about it immediately upon introduction could
>>>>>> flag the problem in a more visible manner.
>>>>>>
>>>>>>
>>>>>> regards,
>>>>>>
>>>>>> /iaw
>>>>>>
>>>>>> ______________________________________________
>>>>>> R-devel at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>>>
>>>>>
>>>>> ______________________________________________
>>>>> R-devel at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> datatable-help mailing list
>>>> datatable-help at lists.r-forge.r-project.org
>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>
>>>
>>> _______________________________________________
>>> datatable-help mailing list
>>> datatable-help at lists.r-forge.r-project.org
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>>
>>
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

-- 
Luke Tierney
Statistics and Actuarial Science
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa                  Phone:             319-335-3386
Department of Statistics and        Fax:               319-335-3017
    Actuarial Science
241 Schaeffer Hall                  email:      luke at stat.uiowa.edu
Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu


More information about the R-devel mailing list