[R] loop over large dataset

Federico Calboli f.calboli at imperial.ac.uk
Mon Jul 4 12:23:12 CEST 2005


In my absentmindedness I'd forgotten to CC this to the list... and  
BTW, using gc() in the loop increases the runtime...


>> My suggestion is that you try to vectorize the computation as much  
>> as you
>> can.
>>
>> From what you've shown, `new' and `ped' need to have the same  
>> number of
>> rows, right?
>>
>> Your `off' function seems to be randomly choosing between columns  
>> 1 and 2
>> from its two input matrices (one row each?).  You may want to do the
>> sampling all at once instead of looping over the rows.  E.g.,
>>
>>
>>
>>> (m <- matrix(1:10, ncol=2))
>>>
>>>
>>      [,1] [,2]
>> [1,]    1    6
>> [2,]    2    7
>> [3,]    3    8
>> [4,]    4    9
>> [5,]    5   10
>>
>>
>>> (colSample <- sample(1:2, nrow(m), replace=TRUE))
>>>
>>>
>> [1] 1 1 2 1 1
>>
>>
>>> (x <- m[cbind(1:nrow(m), colSample)])
>>>
>>>
>> [1] 1 2 8 4 5
>>
>> So you might want to do something like (obviously untested):
>>
>> todo <- ped[,3] * ped[,5] != 0  ## indicator of which rows to work on
>> n.todo <- sum(todo)  ## how many are there?
>> sire <- new[ped[todo, 3], ]
>> dam <- new[ped[todo, 5], ]
>> s.gam <- sire[1:nrow(sire), sample(1:2, nrow(sire), replace=TRUE)]
>> d.gam <- dam[1:nrow(dam), sample(1:2, nrow(dam), replace=TRUE)]
>> new[todo, 1:2] <- cbind(s.gam, d.gam)
>>
>>
>
> Improving the efficiency of the code is abviously a plus, but the  
> real thing I am mesmerised by is the sheer increase in runtime...  
> how come not a linear increase with dataset size?
>
> Cheers,
>
> Federico
>
> --
> Federico C. F. Calboli
> Department of Epidemiology and Public Health
> Imperial College, St. Mary's Campus
> Norfolk Place, London W2 1PG
>
> Tel +44 (0)20 75941602   Fax +44 (0)20 75943193
>
> f.calboli [.a.t] imperial.ac.uk
> f.calboli [.a.t] gmail.com
>
>




More information about the R-help mailing list