[R] speed and looping issues; calculations on big datasets

Mon Jul 2 13:17:12 CEST 2007

I don't fully understand what your objective here, but I would try a 
combination of cut and grep in a shell to see if it works. For example, 
if your data was saved as a tab-delimited file and you have some 
predefined patterns you seek, then try the untested code below

  cut -f3-6 | gsub 's/ //g' > tmp
  grep "^00" tmp | wc >> rightA
  grep "^001" tmp | wc >> rightB
  grep "^010|^0011" tmp | wc >> rightC

  cut -f1-3 | | gsub 's/ //g'
  grep "00$ | wc > leftA
  grep "000$|001$" | wc > leftB

Then you got to write a loop and generalise the codes. You can try this 
in bash, perl or rewrite it in C.

If you want more help, the provide more explanation on what the types of 
pattern you are looking for. You might want to try checking the 
BioConductor packages as well.

Regards, Adai

martin sikora wrote:
> dear r users,
> 
> i'm a little stuck with the following problem(s), hopefully somebody  
> can offer some help:
> 
> i have data organized in a binary matrix, which can become quite big  
> like 60 rows x 10^5 columns (they represent SNP genotypes, for some  
> background info). what i need to do is the following:
> 
> let's suppose i have a matrix of size n x m. for each of the m  
> columns, i want to know the counts of unique rows extended one by one  
> from the "core" column, for both values at the "core" separately and  
> in both directions. maybe better explained with a little example.
> 
> data:
> 
> 00 0 010
> 10 1 001
> 11 1 011
> 10 0 011
> 10 0 010
> 
> so the extended unique rows & counts taking e.g. column 3 as "core" are:
> 
> col 3 = 0:
> right:
> patterns / counts
> 00 / 3
> 001 / 3
> 010, 0011 / 2,1
> 
> left:
> 00 / 3
> 000,001 / 1,2
> 
> and that for the other subset ( col3 = 1) as well, then doing the  
> whole thing again for the next "core" column. the reason i need this  
> counts is that i want to calculate frequencies of the different  
> extended sequences to calculate the probability of drawing two  
> identical sequences from the core up to an extended position from the  
> whole set of sequences.
> 
> my main problem is speed of the calculations. i tried different ways  
> suggested here in the list of getting the counts of the unique rows,  
> all of them using the "table" function. both a combination of table 
> ( do.call( paste, c( as.data.frame( mymatrix) ) ) ) or table( apply 
> ( mymatrix , 2 , paste , collapse ="" ) ) work fine, but are too slow  
> for bigger matrices that i want to calculate (at least in my not very  
> sophisticated function). then i found a great suggestion here to do a  
> matrix multiplication with a vector of 2^(0:ncol-1) to convert each  
> row into a decimal number, and do table on those. this speeds up  
> things quite nicely, although the problem is that it of course does  
> not work as soon as i extended for more than 60 columns, because the  
> decimal numbers get to large to accurately distinguish between a 0  
> and 1 at the smallest digit:
> 
>  > 2^60+2 == 2^60
> [1] TRUE
> 
> another thing is that so far i could not come up with an idea on how  
> or if it is possible to do this without the loops i am using, one  
> large loop for each column in turn as core, and then another loop  
> within that extends the rows by growing column numbers. since i am  
> not the best of programmers (and still quite new to R), i was hoping  
> that somebody has some advice on doing this calculations in a more  
> elegant and more importantly, fast way.
> just to get the idea, the approach with the matrix multiplication  
> takes 20s for a 60 x 220 matrix on my macbook pro, which is obviously  
> not perfect, considering i would like to use this function for  
> matrices of size 10^2 x 10^5 or even more.
> 
> so i would be very thankful for any ideas, suggestions etc to improve  
> this
> 
> cheers
> martin
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
> 
>