[Rd] [datatable-help] speeding up perception
David Winsemius
dwinsemius at comcast.net
Wed Jul 6 03:01:52 CEST 2011
On Jul 5, 2011, at 7:18 PM, <luke-tierney at uiowa.edu> <luke-tierney at uiowa.edu
> wrote:
> On Tue, 5 Jul 2011, Simon Urbanek wrote:
>
>>
>> On Jul 5, 2011, at 2:08 PM, Matthew Dowle wrote:
>>
>>> Simon (and all),
>>>
>>> I've tried to make assignment as fast as calling `[<-.data.table`
>>> directly, for user convenience. Profiling shows (IIUC) that it isn't
>>> dispatch, but x being copied. Is there a way to prevent '[<-' from
>>> copying x?
>>
>> Good point, and conceptually, no. It's a subassignment after all -
>> see R-lang 3.4.4 - it is equivalent to
>>
>> `*tmp*` <- x
>> x <- `[<-`(`*tmp*`, i, j, value)
>> rm(`*tmp*`)
>>
>> so there is always a copy involved.
>>
>> Now, a conceptual copy doesn't mean real copy in R since R tries to
>> keep the pass-by-value illusion while passing references in cases
>> where it knows that modifications cannot occur and/or they are
>> safe. The default subassign method uses that feature which means it
>> can afford to not duplicate if there is only one reference -- then
>> it's safe to not duplicate as we are replacing that only existing
>> reference. And in the case of a matrix, that will be true at the
>> latest from the second subassignment on.
>>
>> Unfortunately the method dispatch (AFAICS) introduces one more
>> reference in the dispatch chain so there will always be two
>> references so duplication is necessary. Since we have only 0 / 1 /
>> 2+ information on the references, we can't distinguish whether the
>> second reference is due to the dispatch or due to the passed object
>> having more than one reference, so we have to duplicate in any
>> case. That is unfortunate, and I don't see a way around (unless we
>> handle subassignment methods is some special way).
>
> I don't believe dispatch is bumping NAMED (and a quick experiment
> seems to confirm this though I don't guarantee I did that right). The
> issue is that a replacement function implemented as a closure, which
> is the only option for a package, will always see NAMED on the object
> to be modified as 2 (because the value is obtained by forcing the
> argument promise) and so any R level assignments will duplicate. This
> also isn't really an issue of imprecise reference counting -- there
> really are (at least) two legitimate references -- one though the
> argument and one through the caller's environment.
>
> It would be good it we could come up with a way for packages to be
> able to define replacement functions that do not duplicate in cases
> where we really don't want them to, but this would require coming up
> with some sort of protocol, minimally involving an efficient way to
> detect whether a replacement funciton is being called in a replacement
> context or directly.
Would "$<-" always satisfy that condition. It would be big help to me
if it could be designed to avoid duplication the rest of the data.frame.
--
>
> There are some replacement functions that use C code to cheat, but
> these may create problems if called directly, so I won't advertise
> them.
>
> Best,
>
> luke
>
>>
>> Cheers,
>> Simon
>>
>>
>>
>
> --
> Luke Tierney
> Statistics and Actuarial Science
> Ralph E. Wareham Professor of Mathematical Sciences
> University of Iowa Phone: 319-335-3386
> Department of Statistics and Fax: 319-335-3017
> Actuarial Science
> 241 Schaeffer Hall email: luke at stat.uiowa.edu
> Iowa City, IA 52242 WWW: http://
> www.stat.uiowa.edu______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
David Winsemius, MD
West Hartford, CT
More information about the R-devel
mailing list