[Rd] [datatable-help] speeding up perception
Matthew Dowle
mdowle at mdowle.plus.com
Tue Jul 12 15:12:06 CEST 2011
Simon,
If you didn't install.packages() with method="source" from R-Forge, that
would explain (some of) it. R-Forge builds binaries once each night. This
commit was long after the cutoff.
Matthew
>> Matthew,
>>
>> I was hoping I misunderstood you first proposal, but I suspect I did not
>> ;).
>>
>> Personally, I find DT[1,V1 <- 3] highly disturbing - I would expect it
>> to
>> evaluate to
>> { V1 <- 3; DT[1, V1] }
>> thus returning the first element of the third column.
>
> Please see FAQ 1.1, since further below it seems to be an expectation
> issue about 'with' syntax, too.
>
>>
>> That said, I don't think it works, either. Taking you example and
>> data.table form r-forge:
> [ snip ]
>> as you can see, DT is not modified.
>
> Works for me on R 2.13.0. I'll try latest R later. If I can't reproduce
> the non-working state I'll need some more environment information please.
>
>> Also I suspect there is something quite amiss because even trivial
>> things
>> don't work:
>>
>>> DF[1:4,1:4]
>> V1 V2 V3 V4
>> 1 3 1 1 1
>> 2 1 1 1 1
>> 3 1 1 1 1
>> 4 1 1 1 1
>>> DT[1:4,1:4]
>> [1] 1 2 3 4
>
> That's correct and fundamental to data.table. See FAQs 1.1, 1.7, 1.8, 1.9
> and 1.10.
>
>>
>> When I first saw your proposal, I thought you have rather something like
>> within(DT, V1[1] <- 3)
>> in mind which looks innocent enough but performs terribly (note that I
>> had
>> to scale down the loop by a factor of 100!!!):
>>
>>> system.time(for (i in 1:10) within(DT, V1[1] <- 3))
>> user system elapsed
>> 2.701 4.437 7.138
>
> No, since 'with' is already built into data.table, I was thinking of
> building 'within' in, too. I'll take a look at within(). Might as well
> provide as many options as possible to the user to use as they wish.
>
>> With the for loop something like within(DF, for (i in 1:1000) V1[i] <-
>> 3))
>> performs reasonably:
>>
>>> system.time(within(DT, for (i in 1:1000) V1[i] <- 3))
>> user system elapsed
>> 0.392 0.613 1.003
>>
>> (Note: system.time() can be misleading when within() is involved,
>> because
>> the expression is evaluated in a different environment so within() won't
>> actually change the object in the global environment - it also
>> interacts
>> with the possible duplication)
>
> Noted, thanks. That's pretty fast. Does within() on data.frame fix the
> original issue Ivo raised, then? If so, job done.
>
>>
>> Cheers,
>> Simon
>>
>> On Jul 11, 2011, at 8:21 PM, Matthew Dowle wrote:
>>
>>> Thanks for the replies and info. An attempt at fast
>>> assign is now committed to data.table v1.6.3 on
>>> R-Forge. From NEWS :
>>>
>>> o Fast update is now implemented, FR#200.
>>> DT[i,j]<-value is now handled by data.table in C rather
>>> than falling through to data.frame methods.
>>>
>>> Thanks to Ivo Welch for raising speed issues on r-devel,
>>> to Simon Urbanek for the suggestion, and Luke Tierney and
>>> Simon for information on R internals.
>>>
>>> [<- syntax still incurs one working copy of the whole
>>> table (as of R 2.13.0) due to R's [<- dispatch mechanism
>>> copying to `*tmp*`, so, for ultimate speed and brevity,
>>> 'within' syntax is now available as follows.
>>>
>>> o A new 'within' argument has been added to [.data.table,
>>> by default TRUE. It is very similar to the within()
>>> function in base R. If an assignment appears in j, it
>>> assigns to the column of DT, by reference; e.g.,
>>>
>>> DT[i,colname<-value]
>>>
>>> This syntax makes no copies of any part of memory at all.
>>>
>>>> m = matrix(1,nrow=100000,ncol=100)
>>>> DF = as.data.frame(m)
>>>> DT = as.data.table(m)
>>>> system.time(for (i in 1:1000) DF[1,1] <- 3)
>>> user system elapsed
>>> 287.730 323.196 613.453
>>>> system.time(for (i in 1:1000) DT[1,V1 <- 3])
>>> user system elapsed
>>> 1.152 0.004 1.161 # 528 times faster
>>>
>>> Please note :
>>>
>>> *******************************************************
>>> ** Within syntax is presently highly experimental. **
>>> *******************************************************
>>>
>>> http://datatable.r-forge.r-project.org/
>>>
>>>
>>> On Wed, 2011-07-06 at 09:08 -0500, luke-tierney at uiowa.edu wrote:
>>>> On Wed, 6 Jul 2011, Simon Urbanek wrote:
>>>>
>>>>> Interesting, and I stand corrected:
>>>>>
>>>>>> x = data.frame(a=1:n,b=1:n)
>>>>>> .Internal(inspect(x))
>>>>> @103511c00 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
>>>>> @102c7b000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,...
>>>>> @102af3000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,...
>>>>>
>>>>>> x[1,1]=42L
>>>>>> .Internal(inspect(x))
>>>>> @10349c720 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
>>>>> @102c19000 13 INTSXP g0c7 [] (len=100000, tl=0) 42,2,3,4,5,...
>>>>> @102b55000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,...
>>>>>
>>>>>> x[[1]][1]=42L
>>>>>> .Internal(inspect(x))
>>>>> @103511a78 19 VECSXP g1c2 [OBJ,MARK,NAM(2),ATT] (len=2, tl=0)
>>>>> @102e65000 13 INTSXP g0c7 [] (len=100000, tl=0) 42,2,3,4,5,...
>>>>> @101f14000 13 INTSXP g1c7 [MARK] (len=100000, tl=0) 1,2,3,4,5,...
>>>>>
>>>>>> x[[1]][1]=42L
>>>>>> .Internal(inspect(x))
>>>>> @10349c800 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
>>>>> @102a2f000 13 INTSXP g0c7 [] (len=100000, tl=0) 42,2,3,4,5,...
>>>>> @102ec7000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,...
>>>>>
>>>>>
>>>>> I have R to release ;) so I won't be looking into this right now, but
>>>>> it's something worth investigating ... Since all the inner contents
>>>>> have NAMED=0 I would not expect any duplication to be needed, but
>>>>> apparently becomes so is at some point ...
>>>>
>>>>
>>>> The internals assume in various places that deep copies are made (one
>>>> of the reasons NAMED setings are not propagated to sub-sturcture).
>>>> The main issues are avoiding cycles and that there is no easy way to
>>>> check for sharing. There may be some circumstances in which a shallow
>>>> copy would be OK but making sure it would be in all cases is probably
>>>> more trouble than it is worth at this point. (I've tried this in the
>>>> past in a few cases and always had to back off.)
>>>>
>>>>
>>>> Best,
>>>>
>>>> luke
>>>>
>>>>>
>>>>> Cheers,
>>>>> Simon
>>>>>
>>>>>
>>>>> On Jul 6, 2011, at 4:36 AM, Matthew Dowle wrote:
>>>>>
>>>>>>
>>>>>> On Tue, 2011-07-05 at 21:11 -0400, Simon Urbanek wrote:
>>>>>>> No subassignment function satisfies that condition, because you can
>>>>>>> always call them directly. However, that doesn't stop the default
>>>>>>> method from making that assumption, so I'm not sure it's an issue.
>>>>>>>
>>>>>>> David, Just to clarify - the data frame content is not copied, we
>>>>>>> are talking about the vector holding columns.
>>>>>>
>>>>>> If it is just the vector holding the columns that is copied (and not
>>>>>> the
>>>>>> columns themselves), why does n make a difference in this test (on R
>>>>>> 2.13.0)?
>>>>>>
>>>>>>> n = 1000
>>>>>>> x = data.frame(a=1:n,b=1:n)
>>>>>>> system.time(for (i in 1:1000) x[1,1] <- 42L)
>>>>>> user system elapsed
>>>>>> 0.628 0.000 0.628
>>>>>>> n = 100000
>>>>>>> x = data.frame(a=1:n,b=1:n) # still 2 columns, but longer
>>>>>>> columns
>>>>>>> system.time(for (i in 1:1000) x[1,1] <- 42L)
>>>>>> user system elapsed
>>>>>> 20.145 1.232 21.455
>>>>>>>
>>>>>>
>>>>>> With $<- :
>>>>>>
>>>>>>> n = 1000
>>>>>>> x = data.frame(a=1:n,b=1:n)
>>>>>>> system.time(for (i in 1:1000) x$a[1] <- 42L)
>>>>>> user system elapsed
>>>>>> 0.304 0.000 0.307
>>>>>>> n = 100000
>>>>>>> x = data.frame(a=1:n,b=1:n)
>>>>>>> system.time(for (i in 1:1000) x$a[1] <- 42L)
>>>>>> user system elapsed
>>>>>> 37.586 0.388 38.161
>>>>>>>
>>>>>>
>>>>>> If it's because the 1st column needs to be copied (only) because
>>>>>> that's
>>>>>> the one being assigned to (in this test), that magnitude of slow
>>>>>> down
>>>>>> doesn't seem consistent with the time of a vector copy of the 1st
>>>>>> column :
>>>>>>
>>>>>>> n=100000
>>>>>>> v = 1:n
>>>>>>> system.time(for (i in 1:1000) v[1] <- 42L)
>>>>>> user system elapsed
>>>>>> 0.016 0.000 0.017
>>>>>>> system.time(for (i in 1:1000) {v2=v;v2[1] <- 42L})
>>>>>> user system elapsed
>>>>>> 1.816 1.076 2.900
>>>>>>
>>>>>> Finally, increasing the number of columns, again only the 1st is
>>>>>> assigned to :
>>>>>>
>>>>>>> n=100000
>>>>>>> x = data.frame(rep(list(1:n),100))
>>>>>>> dim(x)
>>>>>> [1] 100000 100
>>>>>>> system.time(for (i in 1:1000) x[1,1] <- 42L)
>>>>>> user system elapsed
>>>>>> 167.974 50.903 219.711
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Simon
>>>>>>>
>>>>>>> Sent from my iPhone
>>>>>>>
>>>>>>> On Jul 5, 2011, at 9:01 PM, David Winsemius
>>>>>>> <dwinsemius at comcast.net>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> On Jul 5, 2011, at 7:18 PM, <luke-tierney at uiowa.edu>
>>>>>>>> <luke-tierney at uiowa.edu> wrote:
>>>>>>>>
>>>>>>>>> On Tue, 5 Jul 2011, Simon Urbanek wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Jul 5, 2011, at 2:08 PM, Matthew Dowle wrote:
>>>>>>>>>>
>>>>>>>>>>> Simon (and all),
>>>>>>>>>>>
>>>>>>>>>>> I've tried to make assignment as fast as calling
>>>>>>>>>>> `[<-.data.table`
>>>>>>>>>>> directly, for user convenience. Profiling shows (IIUC) that it
>>>>>>>>>>> isn't
>>>>>>>>>>> dispatch, but x being copied. Is there a way to prevent '[<-'
>>>>>>>>>>> from
>>>>>>>>>>> copying x?
>>>>>>>>>>
>>>>>>>>>> Good point, and conceptually, no. It's a subassignment after all
>>>>>>>>>> - see R-lang 3.4.4 - it is equivalent to
>>>>>>>>>>
>>>>>>>>>> `*tmp*` <- x
>>>>>>>>>> x <- `[<-`(`*tmp*`, i, j, value)
>>>>>>>>>> rm(`*tmp*`)
>>>>>>>>>>
>>>>>>>>>> so there is always a copy involved.
>>>>>>>>>>
>>>>>>>>>> Now, a conceptual copy doesn't mean real copy in R since R tries
>>>>>>>>>> to keep the pass-by-value illusion while passing references in
>>>>>>>>>> cases where it knows that modifications cannot occur and/or they
>>>>>>>>>> are safe. The default subassign method uses that feature which
>>>>>>>>>> means it can afford to not duplicate if there is only one
>>>>>>>>>> reference -- then it's safe to not duplicate as we are replacing
>>>>>>>>>> that only existing reference. And in the case of a matrix, that
>>>>>>>>>> will be true at the latest from the second subassignment on.
>>>>>>>>>>
>>>>>>>>>> Unfortunately the method dispatch (AFAICS) introduces one more
>>>>>>>>>> reference in the dispatch chain so there will always be two
>>>>>>>>>> references so duplication is necessary. Since we have only 0 / 1
>>>>>>>>>> / 2+ information on the references, we can't distinguish whether
>>>>>>>>>> the second reference is due to the dispatch or due to the passed
>>>>>>>>>> object having more than one reference, so we have to duplicate
>>>>>>>>>> in
>>>>>>>>>> any case. That is unfortunate, and I don't see a way around
>>>>>>>>>> (unless we handle subassignment methods is some special way).
>>>>>>>>>
>>>>>>>>> I don't believe dispatch is bumping NAMED (and a quick experiment
>>>>>>>>> seems to confirm this though I don't guarantee I did that right).
>>>>>>>>> The
>>>>>>>>> issue is that a replacement function implemented as a closure,
>>>>>>>>> which
>>>>>>>>> is the only option for a package, will always see NAMED on the
>>>>>>>>> object
>>>>>>>>> to be modified as 2 (because the value is obtained by forcing the
>>>>>>>>> argument promise) and so any R level assignments will duplicate.
>>>>>>>>> This
>>>>>>>>> also isn't really an issue of imprecise reference counting --
>>>>>>>>> there
>>>>>>>>> really are (at least) two legitimate references -- one though the
>>>>>>>>> argument and one through the caller's environment.
>>>>>>>>>
>>>>>>>>> It would be good it we could come up with a way for packages to
>>>>>>>>> be
>>>>>>>>> able to define replacement functions that do not duplicate in
>>>>>>>>> cases
>>>>>>>>> where we really don't want them to, but this would require coming
>>>>>>>>> up
>>>>>>>>> with some sort of protocol, minimally involving an efficient way
>>>>>>>>> to
>>>>>>>>> detect whether a replacement funciton is being called in a
>>>>>>>>> replacement
>>>>>>>>> context or directly.
>>>>>>>>
>>>>>>>> Would "$<-" always satisfy that condition. It would be big help to
>>>>>>>> me if it could be designed to avoid duplication the rest of the
>>>>>>>> data.frame.
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>>>
>>>>>>>>> There are some replacement functions that use C code to cheat,
>>>>>>>>> but
>>>>>>>>> these may create problems if called directly, so I won't
>>>>>>>>> advertise
>>>>>>>>> them.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>>
>>>>>>>>> luke
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>> Simon
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Luke Tierney
>>>>>>>>> Statistics and Actuarial Science
>>>>>>>>> Ralph E. Wareham Professor of Mathematical Sciences
>>>>>>>>> University of Iowa Phone:
>>>>>>>>> 319-335-3386
>>>>>>>>> Department of Statistics and Fax:
>>>>>>>>> 319-335-3017
>>>>>>>>> Actuarial Science
>>>>>>>>> 241 Schaeffer Hall email:
>>>>>>>>> luke at stat.uiowa.edu
>>>>>>>>> Iowa City, IA 52242 WWW:
>>>>>>>>> http://www.stat.uiowa.edu______________________________________________
>>>>>>>>> R-devel at r-project.org mailing list
>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>>>>>
>>>>>>>> David Winsemius, MD
>>>>>>>> West Hartford, CT
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> Luke Tierney
>>>> Statistics and Actuarial Science
>>>> Ralph E. Wareham Professor of Mathematical Sciences
>>>> University of Iowa Phone: 319-335-3386
>>>> Department of Statistics and Fax: 319-335-3017
>>>> Actuarial Science
>>>> 241 Schaeffer Hall email: luke at stat.uiowa.edu
>>>> Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu
>>>
>>>
>>>
>>
>>
>
>
More information about the R-devel
mailing list