[R] loop over large dataset

Fri Jul 1 14:12:12 CEST 2005

My suggestion is that you try to vectorize the computation as much as you
can.

>From what you've shown, `new' and `ped' need to have the same number of
rows, right?

Your `off' function seems to be randomly choosing between columns 1 and 2
from its two input matrices (one row each?).  You may want to do the
sampling all at once instead of looping over the rows.  E.g.,

> (m <- matrix(1:10, ncol=2))
     [,1] [,2]
[1,]    1    6
[2,]    2    7
[3,]    3    8
[4,]    4    9
[5,]    5   10
> (colSample <- sample(1:2, nrow(m), replace=TRUE))
[1] 1 1 2 1 1
> (x <- m[cbind(1:nrow(m), colSample)])
[1] 1 2 8 4 5

So you might want to do something like (obviously untested):

todo <- ped[,3] * ped[,5] != 0  ## indicator of which rows to work on
n.todo <- sum(todo)  ## how many are there?
sire <- new[ped[todo, 3], ]
dam <- new[ped[todo, 5], ]
s.gam <- sire[1:nrow(sire), sample(1:2, nrow(sire), replace=TRUE)]
d.gam <- dam[1:nrow(dam), sample(1:2, nrow(dam), replace=TRUE)]
new[todo, 1:2] <- cbind(s.gam, d.gam)

Andy

> From: Federico Calboli
> 
> Hi All,
> 
> I'd like to ask for a few clarifications. I am doing some calculations
> over some biggish datasets. One has ~ 23000 rows, and 6 columns, the
> other has ~620000 rows and 6 columns.
> 
> I am using these datasets to perform a simulation of of haplotype
> coalescence over a pedigree (the datestes themselves are pedigree
> information). I created a new dataset (same number of rows as the
> pedigree dataset, 2 colums) and I use a looping functions to assign
> haplotypes according to a standrd biological reprodictive 
> process (i.e.
> meiosis, sexual reproduction).
> 
> My code is someting like:
> 
>   off = function(sire, dam){ # simulation of reproduction, two inds
>   sch.toll = round(runif(1, min = 1, max = 2))
>   dch.toll = round(runif(1, min = 1, max = 2))
>   s.gam = sire[,sch.toll]
>   d.gam = dam[,dch.toll]
>   offspring = cbind(s.gam,d.gam)
> # offspring
> }
> 
> for (i in 1:dim(new)[1]){
> if(ped[i,3] != 0 & ped[i,5] != 0){
> zz = off(as.matrix(t(new[ped[i,3],])),as.matrix(t(new[ped[i,5],])))
> new[i,1] = zz[1]
> new[i,2] = zz[2]
> }
> }
> 
> I am also attribution a generation index to each row with a trivial
> calulation:
> 
> for(i in atres){
>   genz[i] = (genz[ped[i,3]] + genz[ped[i,5]])/2 + 1
>   #print(genz[i])
> }
> 
> My question then. On the 23000 rows dataset the calculations 
> take about
> 5 minutes. On the 620000 rows one I kill the process after ~24 hours,
> and the the job is not finished. Why such immense discrepancy in
> execution times (the code is the same, the datasets are stored in two
> separate .RData files)?
> 
> Any light would be appreciated.
> 
> Federico
> 
> PS I am running R 2.1.0 on Debian Sarge, on a Dual 3 GHz Xeon machine
> with 2 gig RAM. The R process uses 99% of the CPU, but hardly any RAM
> for what I gather from top.
> 
> 
> 
> -- 
> Federico C. F. Calboli
> Department of Epidemiology and Public Health
> Imperial College, St Mary's Campus
> Norfolk Place, London W2 1PG
> 
> Tel  +44 (0)20 7594 1602     Fax (+44) 020 7594 3193
> 
> f.calboli [.a.t] imperial.ac.uk
> f.calboli [.a.t] gmail.com
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
> 
> 
>