[R] speed and looping issues; calculations on big datasets
martin sikora
martin.sikora at upf.edu
Sun Jul 1 01:32:43 CEST 2007
dear r users,
i'm a little stuck with the following problem(s), hopefully somebody
can offer some help:
i have data organized in a binary matrix, which can become quite big
like 60 rows x 10^5 columns (they represent SNP genotypes, for some
background info). what i need to do is the following:
let's suppose i have a matrix of size n x m. for each of the m
columns, i want to know the counts of unique rows extended one by one
from the "core" column, for both values at the "core" separately and
in both directions. maybe better explained with a little example.
data:
00 0 010
10 1 001
11 1 011
10 0 011
10 0 010
so the extended unique rows & counts taking e.g. column 3 as "core" are:
col 3 = 0:
right:
patterns / counts
00 / 3
001 / 3
0010, 0011 / 2,1
left:
00 / 3
000,001 / 1,2
and that for the other subset ( col3 = 1) as well, then doing the
whole thing again for the next "core" column. the reason i need this
counts is that i want to calculate frequencies of the different
extended sequences to calculate the probability of drawing two
identical sequences from the core up to an extended position from the
whole set of sequences.
my main problem is speed of the calculations. i tried different ways
suggested here in the list of getting the counts of the unique rows,
all of them using the "table" function. both a combination of table
( do.call( paste, c( as.data.frame( mymatrix) ) ) ) or table( apply
( mymatrix , 2 , paste , collapse ="" ) ) work fine, but are too slow
for bigger matrices that i want to calculate (at least in my not very
sophisticated function). then i found a great suggestion here to do a
matrix multiplication with a vector of 2^(0:ncol-1) to convert each
row into a decimal number, and do table on those. this speeds up
things quite nicely, although the problem is that it of course does
not work as soon as i extended for more than 60 columns, because the
decimal numbers get to large to accurately distinguish between a 0
and 1 at the smallest digit:
> 2^60+2 == 2^60
[1] TRUE
another thing is that so far i could not come up with an idea on how
or if it is possible to do this without the loops i am using, one
large loop for each column in turn as core, and then another loop
within that extends the rows by growing column numbers. since i am
not the best of programmers (and still quite new to R), i was hoping
that somebody has some advice on doing this calculations in a more
elegant and more importantly, fast way.
just to get the idea, the approach with the matrix multiplication
takes 20s for a 60 x 220 matrix on my macbook pro, which is obviously
not perfect, considering i would like to use this function for
matrices of size 10^2 x 10^5 or even more.
so i would be very thankful for any ideas, suggestions etc to improve
this
cheers
martin
More information about the R-help
mailing list