[Rd] [datatable-help] speeding up perception

David Winsemius dwinsemius at comcast.net
Wed Jul 6 03:01:52 CEST 2011


On Jul 5, 2011, at 7:18 PM, <luke-tierney at uiowa.edu> <luke-tierney at uiowa.edu 
 > wrote:

> On Tue, 5 Jul 2011, Simon Urbanek wrote:
>
>>
>> On Jul 5, 2011, at 2:08 PM, Matthew Dowle wrote:
>>
>>> Simon (and all),
>>>
>>> I've tried to make assignment as fast as calling `[<-.data.table`
>>> directly, for user convenience. Profiling shows (IIUC) that it isn't
>>> dispatch, but x being copied. Is there a way to prevent '[<-' from
>>> copying x?
>>
>> Good point, and conceptually, no. It's a subassignment after all -  
>> see R-lang 3.4.4 - it is equivalent to
>>
>> `*tmp*` <- x
>> x <- `[<-`(`*tmp*`, i, j, value)
>> rm(`*tmp*`)
>>
>> so there is always a copy involved.
>>
>> Now, a conceptual copy doesn't mean real copy in R since R tries to  
>> keep the pass-by-value illusion while passing references in cases  
>> where it knows that modifications cannot occur and/or they are  
>> safe. The default subassign method uses that feature which means it  
>> can afford to not duplicate if there is only one reference -- then  
>> it's safe to not duplicate as we are replacing that only existing  
>> reference. And in the case of a matrix, that will be true at the  
>> latest from the second subassignment on.
>>
>> Unfortunately the method dispatch (AFAICS) introduces one more  
>> reference in the dispatch chain so there will always be two  
>> references so duplication is necessary. Since we have only 0 / 1 /  
>> 2+ information on the references, we can't distinguish whether the  
>> second reference is due to the dispatch or due to the passed object  
>> having more than one reference, so we have to duplicate in any  
>> case. That is unfortunate, and I don't see a way around (unless we  
>> handle subassignment methods is some special way).
>
> I don't believe dispatch is bumping NAMED (and a quick experiment
> seems to confirm this though I don't guarantee I did that right). The
> issue is that a replacement function implemented as a closure, which
> is the only option for a package, will always see NAMED on the object
> to be modified as 2 (because the value is obtained by forcing the
> argument promise) and so any R level assignments will duplicate.  This
> also isn't really an issue of imprecise reference counting -- there
> really are (at least) two legitimate references -- one though the
> argument and one through the caller's environment.
>
> It would be good it we could come up with a way for packages to be
> able to define replacement functions that do not duplicate in cases
> where we really don't want them to, but this would require coming up
> with some sort of protocol, minimally involving an efficient way to
> detect whether a replacement funciton is being called in a replacement
> context or directly.

Would "$<-" always satisfy that condition. It would be big help to me  
if it could be designed to avoid duplication the rest of the data.frame.

-- 

>
> There are some replacement functions that use C code to cheat, but
> these may create problems if called directly, so I won't advertise
> them.
>
> Best,
>
> luke
>
>>
>> Cheers,
>> Simon
>>
>>
>>
>
> -- 
> Luke Tierney
> Statistics and Actuarial Science
> Ralph E. Wareham Professor of Mathematical Sciences
> University of Iowa                  Phone:             319-335-3386
> Department of Statistics and        Fax:               319-335-3017
>   Actuarial Science
> 241 Schaeffer Hall                  email:      luke at stat.uiowa.edu
> Iowa City, IA 52242                 WWW:  http:// 
> www.stat.uiowa.edu______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

David Winsemius, MD
West Hartford, CT



More information about the R-devel mailing list