[R] Effeciently sum 3d table

Bert Gunter gunter.berton at gene.com
Tue Apr 17 00:31:19 CEST 2012


That _is_ interesting. Reduce() calls the sum function at the
interpreted level, so I would not expect this. Can you check whether
most of the time for my "vectorized" version is spent on the
do.call(cbind ...) part, which is what I would guess. Otherwise, this
sounds strange, since .rowSums is specifically built for speed -- so
it says.. I also assume z is as I constructed.

-- Bert



On Mon, Apr 16, 2012 at 3:01 PM, David Winsemius <dwinsemius at comcast.net> wrote:
>
> On Apr 16, 2012, at 4:32 PM, Bert Gunter wrote:
>
>> David:
>>
>> Here is a comparison of the gains to be made by vectorization (again,
>> assuming I have interpreted your query correctly)
>>
>> ## create a list of arrays
>>>
>>> z <- lapply(seq_len(10000),function(i)array(runif(24),dim=2:4))
>>
>> ## Using an apply type approach
>>>
>>> system.time(ans1 <- array(do.call(mapply,c(sum,z)),dim=2:4))
>>
>>  user  system elapsed
>>  0.62    0.00    0.62
>> ## vectorizing via rowSums and cbind
>>>
>>> system.time(ans2 <-array(rowSums(do.call(cbind,z)),dim=2:4))
>>
>>  user  system elapsed
>>  0.02    0.00    0.02
>>>
>>> identical(ans1,ans2)
>>
>> [1] TRUE
>>
>
> It's an example as well for the possibility that different OSes may perform
> differently. My Mac (an early 2008 model) is nowhere nearly as efficient
> with the second solution, despite being the the same ballpark with the
> first:
>
>> system.time(ans1 <- array(do.call(mapply,c(sum,z)),dim=2:4))
>   user  system elapsed
>  0.841   0.007   0.851
>> system.time(ans2 <-array(rowSums(do.call(cbind,z)),dim=2:4))
>   user  system elapsed
>  0.132   0.003   0.145
>
> And on my system ....  the Reduce strategy is fastest:
>
>> system.time(ans3 <- Reduce("+", z) )
>   user  system elapsed
>  0.129   0.001   0.134
>
> And ...the Reduce() strategy would preserve other object attributes,
> something I'm quite sure the re-dimensioning of rowSums(cbind(.)) could not
> preserve.
>
>  L <- list( table(a, sample(a)) ,
>            table(a, sample(a)),
>            table(a, sample(a)),
>            table(a, sample(a)),
>            table(a, sample(a)) )
>
>  str(Reduce("+", L) )
>  'table' int [1:3, 1:3] 1 1 3 4 0 1 0 4 1
>  - attr(*, "dimnames")=List of 2
>  ..$ a: chr [1:3] "a" "b" "c"
>  ..$  : chr [1:3] "a" "b" "c"
>
>  str( array(rowSums(do.call(cbind,L)),dim=c(3,3))  )
>  num [1:3, 1:3] 5 5 5 5 5 5 5 5 5
>
>
> -- David.
>
>
>> Cheers,
>> Bert
>>
>>
>>
>> On Mon, Apr 16, 2012 at 1:19 PM, David A Vavra <davavra at verizon.net>
>> wrote:
>>>
>>> Thanks Bill,
>>>
>>>
>>>
>>> For reasons that aren't important here, I must start from a list.
>>> Computing
>>> the sum while generating the tables may be a solution but it means doing
>>> something in one piece of code that is unrelated to the surrounding code.
>>> Bad practice where I'm from. If it's needed it's needed but if I can
>>> avoid
>>> doing so, I will.
>>>
>>>
>>>
>>> I haven't done any timing but because of the extra operations of get and
>>> assign, the non-loop implementation will likely suffer. It seems you have
>>> shown this to be true.
>>>
>>>
>>>
>>> DAV
>>>
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: William Dunlap [mailto:wdunlap at tibco.com]
>>> Sent: Monday, April 16, 2012 3:26 PM
>>> To: David A Vavra; 'Bert Gunter'
>>> Cc: r-help at r-project.org
>>> Subject: RE: [R] Effeciently sum 3d table
>>>
>>>
>>>
>>>> Example in partial code:
>>>
>>>
>>>>
>>>
>>>> Env <- CreatEnv() # my own function
>>>
>>>
>>>> Assign('final',T1-T1,envir=env)
>>>
>>>
>>>> L<-listOfTables
>>>
>>>
>>>>
>>>
>>>> lapply(L,function(t) {
>>>
>>>
>>>>    final <- get('final',envir=env) + t
>>>
>>>
>>>>    assign('final',final,envir=env)
>>>
>>>
>>>>    NULL
>>>
>>>
>>>> })
>>>
>>>
>>>
>>>
>>> First, finish writing that code so it runs and you can make sure its
>>>
>>> output is ok:
>>>
>>>
>>>
>>> L <- lapply(1:50000, function(i) array(i:(i+3), c(2,2))) # list of 50,000
>>> 2x2 matrices
>>>
>>> env <- new.env()
>>>
>>> assign('final', L[[1]] - L[[1]], envir=env)
>>>
>>> junk <- lapply(L, function(t) {
>>>
>>>    final <- get('final', envir=env) + t
>>>
>>>    assign('final', final, envir=env)
>>>
>>>    NULL
>>>
>>> })
>>>
>>> get('final', envir=env)
>>>
>>> #            [,1]       [,2]
>>>
>>> # [1,] 1250025000 1250125000
>>>
>>> # [2,] 1250075000 1250175000
>>>
>>>> sum( (2:50001) ) # should be final[2,1]
>>>
>>>
>>> # [1] 1250075000
>>>
>>>
>>>
>>> You asked for something less "clunky".
>>>
>>> You are fighting the system by using get() and assign(), just use
>>>
>>> ordinary expression syntax to get and set variables:
>>>
>>> final <- L[[1]]
>>>
>>> for(i in seq_along(L)[-1]) final <- final + L[[i]]
>>>
>>> final
>>>
>>> #           [,1]       [,2]
>>>
>>> # [1,] 1250025000 1250125000
>>>
>>> # [2,] 1250075000 1250175000
>>>
>>>
>>>
>>> The former took 0.22 seconds on my machine, the latter 0.06.
>>>
>>>
>>>
>>> You don't have to compute the whole list of matrices before
>>>
>>> doing the sum, just add to the current sum when you have
>>>
>>> computed one matrix and then forget about it.
>>>
>>>
>>>
>>> Bill Dunlap
>>>
>>> Spotfire, TIBCO Software
>>>
>>> wdunlap tibco.com
>>>
>>>
>>>
>>>
>>>
>>>> -----Original Message-----
>>>
>>>
>>>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
>>>
>>> On Behalf
>>>
>>>> Of David A Vavra
>>>
>>>
>>>> Sent: Monday, April 16, 2012 11:35 AM
>>>
>>>
>>>> To: 'Bert Gunter'
>>>
>>>
>>>> Cc: r-help at r-project.org
>>>
>>>
>>>> Subject: Re: [R] Effeciently sum 3d table
>>>
>>>
>>>>
>>>
>>>> Thanks Gunter,
>>>
>>>
>>>>
>>>
>>>> I mean what I think is the normal definition of 'sum' as in:
>>>
>>>
>>>>   T1 + T2 + T3 + ...
>>>
>>>
>>>> It never occurred to me that there would be a question.
>>>
>>>
>>>>
>>>
>>>> I have gotten the impression that a for loop is very inefficient.
>>>> Whenever
>>>
>>> I
>>>
>>>> change them to lapply calls there is a noticeable improvement in run
>>>> time
>>>
>>>
>>>> for whatever reason. The problem with lapply here is that I effectively
>>>
>>> need
>>>
>>>> a global table to hold the final sum. lapply also  wants to return a
>>>
>>> value.
>>>
>>>>
>>>
>>>> You may be correct that in the long run, the loop is the best. There's a
>>>
>>> lot
>>>
>>>> of extraneous memory wastage holding all of the tables in a list as well
>>>
>>> as
>>>
>>>> the return 'values'.
>>>
>>>
>>>>
>>>
>>>> As an alternate and given a pre-existing list of tables, I was thinking
>>>> of
>>>
>>>
>>>> creating a temporary environment to hold the final result so it could be
>>>
>>>
>>>> passed globally to each lapply execution level but that seems clunky and
>>>
>>>
>>>> wasteful as well.
>>>
>>>
>>>>
>>>
>>>> Example in partial code:
>>>
>>>
>>>>
>>>
>>>> Env <- CreatEnv() # my own function
>>>
>>>
>>>> Assign('final',T1-T1,envir=env)
>>>
>>>
>>>> L<-listOfTables
>>>
>>>
>>>>
>>>
>>>> lapply(L,function(t) {
>>>
>>>
>>>>    final <- get('final',envir=env) + t
>>>
>>>
>>>>    assign('final',final,envir=env)
>>>
>>>
>>>>    NULL
>>>
>>>
>>>> })
>>>
>>>
>>>>
>>>
>>>> But I was hoping for a more elegant and hopefully more efficient
>>>> solution.
>>>
>>>
>>>> Greg's suggestion for using reduce seems in order but as yet I'm
>>>
>>> unfamiliar
>>>
>>>> with the function.
>>>
>>>
>>>>
>>>
>>>> DAV
>>>
>>>
>>>>
>>>
>>>>
>>>
>>>>
>>>
>>>> -----Original Message-----
>>>
>>>
>>>> From: Bert Gunter [mailto:gunter.berton at gene.com]
>>>
>>>
>>>> Sent: Monday, April 16, 2012 12:42 PM
>>>
>>>
>>>> To: Greg Snow
>>>
>>>
>>>> Cc: David A Vavra; r-help at r-project.org
>>>
>>>
>>>> Subject: Re: [R] Effeciently sum 3d table
>>>
>>>
>>>>
>>>
>>>> Define "sum" . Do you mean you want to get a single sum for each
>>>
>>>
>>>> array? -- get marginal sums for each array? -- get a single array in
>>>
>>>
>>>> which each value is the sum of all the individual values at the
>>>
>>>
>>>> position?
>>>
>>>
>>>>
>>>
>>>> Due thought and consideration for those trying to help by formulating
>>>
>>>
>>>> your query carefully and concisely vastly increases the chance of
>>>
>>>
>>>> getting a useful answer. See the posting guide -- this is a skill that
>>>
>>>
>>>> needs to be learned and the guide is quite helpful. And I must
>>>
>>>
>>>> acknowledge that it is a skill that I also have not yet mastered.
>>>
>>>
>>>>
>>>
>>>> Concerning your query, I would only note that the two responses from
>>>
>>>
>>>> Greg and Petr that you received are unlikely to be significantly
>>>
>>>
>>>> faster than just using loops, since both are still essentially looping
>>>
>>>
>>>> at the interpreted level. Whether either give you what you want, I do
>>>
>>>
>>>> not know.
>>>
>>>
>>>>
>>>
>>>> -- Bert
>>>
>>>
>>>>
>>>
>>>> On Mon, Apr 16, 2012 at 8:53 AM, Greg Snow <538280 at gmail.com> wrote:
>>>
>>>
>>>>> Look at the Reduce function.
>>>
>>>
>>>>>
>>>
>>>>> On Mon, Apr 16, 2012 at 8:28 AM, David A Vavra <davavra at verizon.net>
>>>
>>>
>>>> wrote:
>>>
>>>
>>>>>> I have a large number of 3d tables that I wish to sum
>>>
>>>
>>>>>> Is there an efficient way to do this? Or perhaps a function I can
>>>>>> call?
>>>
>>>
>>>>>>
>>>
>>>>>> I tried using do.call("sum",listoftables) but that returns a single
>>>
>>>
>>>> value.
>>>
>>>
>>>>>>
>>>
>>>>>> So far, it seems only a loop will do the job.
>>>
>>>
>>>>>>
>>>
>>>>>>
>>>
>>>>>> TIA,
>>>
>>>
>>>>>> DAV
>>>
>>>
>>>>
>>>
>>>>
>>>
>>>> --
>>>
>>>
>>>>
>>>
>>>> Bert Gunter
>>>
>>>
>>>> Genentech Nonclinical Biostatistics
>>>
>>>
>>>>
>>>
>>>> Internal Contact Info:
>>>
>>>
>>>> Phone: 467-7374
>>>
>>>
>>>> Website:
>>>
>>>
>>>>
>>>
>>> http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biost
>>>
>>>> atistics/pdb-ncb-home.htm
>>>
>>>
>>>>
>>>
>>>> ______________________________________________
>>>
>>>
>>>> R-help at r-project.org mailing list
>>>
>>>
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>
>>>
>>>> PLEASE do read the posting guide
>>>
>>> http://www.R-project.org/posting-guide.html
>>>
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>>>
>>>       [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>>
>>
>> --
>>
>> Bert Gunter
>> Genentech Nonclinical Biostatistics
>>
>> Internal Contact Info:
>> Phone: 467-7374
>> Website:
>>
>> http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>
> David Winsemius, MD
> West Hartford, CT
>



-- 

Bert Gunter
Genentech Nonclinical Biostatistics

Internal Contact Info:
Phone: 467-7374
Website:
http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm



More information about the R-help mailing list