[Rd] [datatable-help] speeding up perception

Tue Jul 12 15:59:41 CEST 2011

On Jul 12, 2011, at 6:24 AM, Matthew Dowle wrote:

>> Matthew,
>> 
>> I was hoping I misunderstood you first proposal, but I suspect I did not
>> ;).
>> 
>> Personally, I find  DT[1,V1 <- 3] highly disturbing - I would expect it to
>> evaluate to
>> { V1 <- 3; DT[1, V1] }
>> thus returning the first element of the third column.
> 
> Please see FAQ 1.1, since further below it seems to be an expectation
> issue about 'with' syntax, too.
> 

Just to clarify - the NEWS has led me to believe that the destructive DT[i, x <- y] syntax is new. That is what my objection is about. I'm fine with subsetting operators working on expressions but I'm not happy with subsetting operators modifying the the object they are subsetting - since it's subsetting not subassignemnt - that's what I was referring to.

>> That said, I don't think it works, either. Taking you example and
>> data.table form r-forge:
> [ snip ]
>> as you can see, DT is not modified.
> 
> Works for me on R 2.13.0. I'll try latest R later. If I can't reproduce
> the non-working state I'll need some more environment information please.
> 

The issue persist on several machines I tested - including R 2.13.0:

> sessionInfo()
R version 2.13.0 Patched (2011-05-15 r55914)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.6.3

> sessionInfo()
R version 2.13.0 (2011-04-13)
Platform: x86_64-unknown-linux-gnu/amd64 (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.6.3
> DT = as.data.table(m)
> for (i in 1:1000) DT[1,V1 <- 3]
> DT[1,]
     V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21
[1,]  1  1  1  1  1  1  1  1  1   1   1   1   1   1   1   1   1   1   1   1   1

>> Also I suspect there is something quite amiss because even trivial things
>> don't work:
>> 
>>> DF[1:4,1:4]
>>  V1 V2 V3 V4
>> 1  3  1  1  1
>> 2  1  1  1  1
>> 3  1  1  1  1
>> 4  1  1  1  1
>>> DT[1:4,1:4]
>> [1] 1 2 3 4
> 
> That's correct and fundamental to data.table. See FAQs 1.1, 1.7, 1.8, 1.9
> and 1.10.
> 

Fair enough, I expected data.table to be a drop-in replacement of data.frames - I just wanted to check the values. Apparently it's not, by design, hence assumption was wrong.

>> 
>> When I first saw your proposal, I thought you have rather something like
>> within(DT, V1[1] <- 3)
>> in mind which looks innocent enough but performs terribly (note that I had
>> to scale down the loop by a factor of 100!!!):
>> 
>>> system.time(for (i in 1:10) within(DT, V1[1] <- 3))
>>   user  system elapsed
>>  2.701   4.437   7.138
> 
> No, since 'with' is already built into data.table, I was thinking of
> building 'within' in, too. I'll take a look at within(). Might as well
> provide as many options as possible to the user to use as they wish.
> 
>> With the for loop something like within(DF, for (i in 1:1000) V1[i] <- 3))
>> performs reasonably:
>> 
>>> system.time(within(DT, for (i in 1:1000) V1[i] <- 3))
>>   user  system elapsed
>>  0.392   0.613   1.003
>> 
>> (Note: system.time() can be misleading when within() is involved, because
>> the expression is evaluated in a different environment so within() won't
>> actually change the object in the  global environment - it also interacts
>> with the possible duplication)
> 
> Noted, thanks. That's pretty fast. Does within() on data.frame fix the
> original issue Ivo raised, then?  If so, job done.
> 

I don't think so - at least not in the strict sense of no copies (more digging may be needed, though, since it does so in system.time, possibly due to the NAMED value of the forced promise but I did not check). However, it allows to express the modification inside the expression which will save the global copy and thus be faster that the outside loop.

Cheers,
Simon

>> 
>> Cheers,
>> Simon
>> 
>> On Jul 11, 2011, at 8:21 PM, Matthew Dowle wrote:
>> 
>>> Thanks for the replies and info. An attempt at fast
>>> assign is now committed to data.table v1.6.3 on
>>> R-Forge. From NEWS :
>>> 
>>> o   Fast update is now implemented, FR#200.
>>>   DT[i,j]<-value is now handled by data.table in C rather
>>>   than falling through to data.frame methods.
>>> 
>>>   Thanks to Ivo Welch for raising speed issues on r-devel,
>>>   to Simon Urbanek for the suggestion, and Luke Tierney and
>>>   Simon for information on R internals.
>>> 
>>>   [<- syntax still incurs one working copy of the whole
>>>   table (as of R 2.13.0) due to R's [<- dispatch mechanism
>>>   copying to `*tmp*`, so, for ultimate speed and brevity,
>>>   'within' syntax is now available as follows.
>>> 
>>> o   A new 'within' argument has been added to [.data.table,
>>>   by default TRUE. It is very similar to the within()
>>>   function in base R. If an assignment appears in j, it
>>>   assigns to the column of DT, by reference; e.g.,
>>> 
>>>   DT[i,colname<-value]
>>> 
>>>   This syntax makes no copies of any part of memory at all.
>>> 
>>>> m = matrix(1,nrow=100000,ncol=100)
>>>> DF = as.data.frame(m)
>>>> DT = as.data.table(m)
>>>> system.time(for (i in 1:1000) DF[1,1] <- 3)
>>>      user  system elapsed
>>>   287.730 323.196 613.453
>>>> system.time(for (i in 1:1000) DT[1,V1 <- 3])
>>>      user  system elapsed
>>>     1.152   0.004   1.161         # 528 times faster
>>> 
>>> Please note :
>>> 
>>>   *******************************************************
>>>   **  Within syntax is presently highly experimental.  **
>>>   *******************************************************
>>> 
>>> http://datatable.r-forge.r-project.org/
>>> 
>>> 
>>> On Wed, 2011-07-06 at 09:08 -0500, luke-tierney at uiowa.edu wrote:
>>>> On Wed, 6 Jul 2011, Simon Urbanek wrote:
>>>> 
>>>>> Interesting, and I stand corrected:
>>>>> 
>>>>>> x = data.frame(a=1:n,b=1:n)
>>>>>> .Internal(inspect(x))
>>>>> @103511c00 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
>>>>> @102c7b000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,...
>>>>> @102af3000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,...
>>>>> 
>>>>>> x[1,1]=42L
>>>>>> .Internal(inspect(x))
>>>>> @10349c720 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
>>>>> @102c19000 13 INTSXP g0c7 [] (len=100000, tl=0) 42,2,3,4,5,...
>>>>> @102b55000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,...
>>>>> 
>>>>>> x[[1]][1]=42L
>>>>>> .Internal(inspect(x))
>>>>> @103511a78 19 VECSXP g1c2 [OBJ,MARK,NAM(2),ATT] (len=2, tl=0)
>>>>> @102e65000 13 INTSXP g0c7 [] (len=100000, tl=0) 42,2,3,4,5,...
>>>>> @101f14000 13 INTSXP g1c7 [MARK] (len=100000, tl=0) 1,2,3,4,5,...
>>>>> 
>>>>>> x[[1]][1]=42L
>>>>>> .Internal(inspect(x))
>>>>> @10349c800 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
>>>>> @102a2f000 13 INTSXP g0c7 [] (len=100000, tl=0) 42,2,3,4,5,...
>>>>> @102ec7000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,...
>>>>> 
>>>>> 
>>>>> I have R to release ;) so I won't be looking into this right now, but
>>>>> it's something worth investigating ... Since all the inner contents
>>>>> have NAMED=0 I would not expect any duplication to be needed, but
>>>>> apparently becomes so is at some point ...
>>>> 
>>>> 
>>>> The internals assume in various places that deep copies are made (one
>>>> of the reasons NAMED setings are not propagated to sub-sturcture).
>>>> The main issues are avoiding cycles and that there is no easy way to
>>>> check for sharing.  There may be some circumstances in which a shallow
>>>> copy would be OK but making sure it would be in all cases is probably
>>>> more trouble than it is worth at this point. (I've tried this in the
>>>> past in a few cases and always had to back off.)
>>>> 
>>>> 
>>>> Best,
>>>> 
>>>> luke
>>>> 
>>>>> 
>>>>> Cheers,
>>>>> Simon
>>>>> 
>>>>> 
>>>>> On Jul 6, 2011, at 4:36 AM, Matthew Dowle wrote:
>>>>> 
>>>>>> 
>>>>>> On Tue, 2011-07-05 at 21:11 -0400, Simon Urbanek wrote:
>>>>>>> No subassignment function satisfies that condition, because you can
>>>>>>> always call them directly. However, that doesn't stop the default
>>>>>>> method from making that assumption, so I'm not sure it's an issue.
>>>>>>> 
>>>>>>> David, Just to clarify - the data frame content is not copied, we
>>>>>>> are talking about the vector holding columns.
>>>>>> 
>>>>>> If it is just the vector holding the columns that is copied (and not
>>>>>> the
>>>>>> columns themselves), why does n make a difference in this test (on R
>>>>>> 2.13.0)?
>>>>>> 
>>>>>>> n = 1000
>>>>>>> x = data.frame(a=1:n,b=1:n)
>>>>>>> system.time(for (i in 1:1000) x[1,1] <- 42L)
>>>>>> user  system elapsed
>>>>>> 0.628   0.000   0.628
>>>>>>> n = 100000
>>>>>>> x = data.frame(a=1:n,b=1:n)      # still 2 columns, but longer
>>>>>>> columns
>>>>>>> system.time(for (i in 1:1000) x[1,1] <- 42L)
>>>>>> user  system elapsed
>>>>>> 20.145   1.232  21.455
>>>>>>> 
>>>>>> 
>>>>>> With $<- :
>>>>>> 
>>>>>>> n = 1000
>>>>>>> x = data.frame(a=1:n,b=1:n)
>>>>>>> system.time(for (i in 1:1000) x$a[1] <- 42L)
>>>>>> user  system elapsed
>>>>>> 0.304   0.000   0.307
>>>>>>> n = 100000
>>>>>>> x = data.frame(a=1:n,b=1:n)
>>>>>>> system.time(for (i in 1:1000) x$a[1] <- 42L)
>>>>>> user  system elapsed
>>>>>> 37.586   0.388  38.161
>>>>>>> 
>>>>>> 
>>>>>> If it's because the 1st column needs to be copied (only) because
>>>>>> that's
>>>>>> the one being assigned to (in this test), that magnitude of slow down
>>>>>> doesn't seem consistent with the time of a vector copy of the 1st
>>>>>> column :
>>>>>> 
>>>>>>> n=100000
>>>>>>> v = 1:n
>>>>>>> system.time(for (i in 1:1000) v[1] <- 42L)
>>>>>> user  system elapsed
>>>>>> 0.016   0.000   0.017
>>>>>>> system.time(for (i in 1:1000) {v2=v;v2[1] <- 42L})
>>>>>> user  system elapsed
>>>>>> 1.816   1.076   2.900
>>>>>> 
>>>>>> Finally, increasing the number of columns, again only the 1st is
>>>>>> assigned to :
>>>>>> 
>>>>>>> n=100000
>>>>>>> x = data.frame(rep(list(1:n),100))
>>>>>>> dim(x)
>>>>>> [1] 100000    100
>>>>>>> system.time(for (i in 1:1000) x[1,1] <- 42L)
>>>>>> user  system elapsed
>>>>>> 167.974  50.903 219.711
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> Cheers,
>>>>>>> Simon
>>>>>>> 
>>>>>>> Sent from my iPhone
>>>>>>> 
>>>>>>> On Jul 5, 2011, at 9:01 PM, David Winsemius <dwinsemius at comcast.net>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> 
>>>>>>>> On Jul 5, 2011, at 7:18 PM, <luke-tierney at uiowa.edu>
>>>>>>>> <luke-tierney at uiowa.edu> wrote:
>>>>>>>> 
>>>>>>>>> On Tue, 5 Jul 2011, Simon Urbanek wrote:
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Jul 5, 2011, at 2:08 PM, Matthew Dowle wrote:
>>>>>>>>>> 
>>>>>>>>>>> Simon (and all),
>>>>>>>>>>> 
>>>>>>>>>>> I've tried to make assignment as fast as calling
>>>>>>>>>>> `[<-.data.table`
>>>>>>>>>>> directly, for user convenience. Profiling shows (IIUC) that it
>>>>>>>>>>> isn't
>>>>>>>>>>> dispatch, but x being copied. Is there a way to prevent '[<-'
>>>>>>>>>>> from
>>>>>>>>>>> copying x?
>>>>>>>>>> 
>>>>>>>>>> Good point, and conceptually, no. It's a subassignment after all
>>>>>>>>>> - see R-lang 3.4.4 - it is equivalent to
>>>>>>>>>> 
>>>>>>>>>> `*tmp*` <- x
>>>>>>>>>> x <- `[<-`(`*tmp*`, i, j, value)
>>>>>>>>>> rm(`*tmp*`)
>>>>>>>>>> 
>>>>>>>>>> so there is always a copy involved.
>>>>>>>>>> 
>>>>>>>>>> Now, a conceptual copy doesn't mean real copy in R since R tries
>>>>>>>>>> to keep the pass-by-value illusion while passing references in
>>>>>>>>>> cases where it knows that modifications cannot occur and/or they
>>>>>>>>>> are safe. The default subassign method uses that feature which
>>>>>>>>>> means it can afford to not duplicate if there is only one
>>>>>>>>>> reference -- then it's safe to not duplicate as we are replacing
>>>>>>>>>> that only existing reference. And in the case of a matrix, that
>>>>>>>>>> will be true at the latest from the second subassignment on.
>>>>>>>>>> 
>>>>>>>>>> Unfortunately the method dispatch (AFAICS) introduces one more
>>>>>>>>>> reference in the dispatch chain so there will always be two
>>>>>>>>>> references so duplication is necessary. Since we have only 0 / 1
>>>>>>>>>> / 2+ information on the references, we can't distinguish whether
>>>>>>>>>> the second reference is due to the dispatch or due to the passed
>>>>>>>>>> object having more than one reference, so we have to duplicate in
>>>>>>>>>> any case. That is unfortunate, and I don't see a way around
>>>>>>>>>> (unless we handle subassignment methods is some special way).
>>>>>>>>> 
>>>>>>>>> I don't believe dispatch is bumping NAMED (and a quick experiment
>>>>>>>>> seems to confirm this though I don't guarantee I did that right).
>>>>>>>>> The
>>>>>>>>> issue is that a replacement function implemented as a closure,
>>>>>>>>> which
>>>>>>>>> is the only option for a package, will always see NAMED on the
>>>>>>>>> object
>>>>>>>>> to be modified as 2 (because the value is obtained by forcing the
>>>>>>>>> argument promise) and so any R level assignments will duplicate.
>>>>>>>>> This
>>>>>>>>> also isn't really an issue of imprecise reference counting --
>>>>>>>>> there
>>>>>>>>> really are (at least) two legitimate references -- one though the
>>>>>>>>> argument and one through the caller's environment.
>>>>>>>>> 
>>>>>>>>> It would be good it we could come up with a way for packages to be
>>>>>>>>> able to define replacement functions that do not duplicate in
>>>>>>>>> cases
>>>>>>>>> where we really don't want them to, but this would require coming
>>>>>>>>> up
>>>>>>>>> with some sort of protocol, minimally involving an efficient way
>>>>>>>>> to
>>>>>>>>> detect whether a replacement funciton is being called in a
>>>>>>>>> replacement
>>>>>>>>> context or directly.
>>>>>>>> 
>>>>>>>> Would "$<-" always satisfy that condition. It would be big help to
>>>>>>>> me if it could be designed to avoid duplication the rest of the
>>>>>>>> data.frame.
>>>>>>>> 
>>>>>>>> --
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> There are some replacement functions that use C code to cheat, but
>>>>>>>>> these may create problems if called directly, so I won't advertise
>>>>>>>>> them.
>>>>>>>>> 
>>>>>>>>> Best,
>>>>>>>>> 
>>>>>>>>> luke
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Cheers,
>>>>>>>>>> Simon
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Luke Tierney
>>>>>>>>> Statistics and Actuarial Science
>>>>>>>>> Ralph E. Wareham Professor of Mathematical Sciences
>>>>>>>>> University of Iowa                  Phone:
>>>>>>>>> 319-335-3386
>>>>>>>>> Department of Statistics and        Fax:
>>>>>>>>> 319-335-3017
>>>>>>>>> Actuarial Science
>>>>>>>>> 241 Schaeffer Hall                  email:
>>>>>>>>> luke at stat.uiowa.edu
>>>>>>>>> Iowa City, IA 52242                 WWW:
>>>>>>>>> http://www.stat.uiowa.edu______________________________________________
>>>>>>>>> R-devel at r-project.org mailing list
>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>>>>> 
>>>>>>>> David Winsemius, MD
>>>>>>>> West Hartford, CT
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> --
>>>> Luke Tierney
>>>> Statistics and Actuarial Science
>>>> Ralph E. Wareham Professor of Mathematical Sciences
>>>> University of Iowa                  Phone:             319-335-3386
>>>> Department of Statistics and        Fax:               319-335-3017
>>>>   Actuarial Science
>>>> 241 Schaeffer Hall                  email:      luke at stat.uiowa.edu
>>>> Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu
>>> 
>>> 
>>> 
>> 
>> 
> 
> 
>