[R] calculating DNA mismatch distributions for large populations
Allan Strand
stranda at cofc.edu
Sat Oct 6 19:45:28 CEST 2001
Hi all,
I am interested in calculating and displaying the distributions of
pairwise comparisons between DNA sequences in populations. The
comparisons are the number of nucleotide sites that differ between the
two sequences (mismatches). My sequences are stored in a vector of
strings. There is an additional vector of the same length that
provides the indices to the DNA sequences. Finally, I have output
from table() that gives the distribution of indices in the population.
It looks something like this:
seqindex 1,2,3,4,5
sequences "ATG","ATC","AAT","CTT","GTC"
table() output of
seqindex frequencies:
1 2 3 4 5
6 10 5 1 7
The sequences are usually around 100 characters long (rather than 3)
and there can be several hundred different unique sequences and the
freqencies of sequence indices may total many thousands (often
30,000).
Right now, I am trying the following approach:
1) I create a 'distance matrix' of mismatches. I have been calculating
the mismatches using the following approach (states() creates a
list of two vectors: sl$aindex==indices and sl$state==sequences)
sl<-states(lnum,Rland);
rmat<-matrix(0,nrow=length(sl[[1]]),ncol=length(sl[[1]]));
for (i in 1:length(sl[[1]]))
for (j in i:length(sl[[1]]))
{
if (i!=j)
{
vi<-strsplit(sl$state[[i]],NULL)[[1]]
vj<-strsplit(sl$state[[j]],NULL)[[1]]
rmat[j,i]<-length(vi)-sum(vi==vj);
rmat[j,i]->rmat[i,j];
}
}
rmat
this works pretty well, although it is not blindingly fast.
2) I have then tried this approach to get the distribution of
mismatches for the entire population:
I have tried to concatenate X copies of rows of rmat
corresponding to a particular sequence index. I'm using
"rep(rmat[seqindex,],X)". Here, X is the the frequency of that
sequence index (and sequence) in the population. I try then to
repeat this for every sequence index present in the pop. My
plan was to then run table() on the resulting vector. As I
suspected, trying this with real data brought my 450Mhz/128Meg
RAM box to its knees and I never had the patience to wait for the
finished product. This effect is due to large amounts of memory
usage.
My question is: What is a more efficient (in terms of memory and/or
speed) way to solve this problem. Clearly, this partially brain-dead
biologist has not found it. I think this question really comes down
to what is the efficient way to solve part 2. One possibility that
occurred to me is to use table() on each row of rmat, then multiply
the resulting table by the number of times that the sequence is found
in the population. I'm not clear on how to combine the resulting
tables for all sequences however. Nor am I clear on whether this is
an improvement on my first approach. Certainly, it would seem that
this could improve memory usage.
Thanks for any suggestions,
a.
BTW:
platform i386-unknown-freebsd4.4
arch i386
os freebsd4.4
system i386, freebsd4.4
status
major 1
minor 3.1
year 2001
month 08
day 31
language R
--
Allan Strand, Biology http://linum.cofc.edu
College of Charleston Ph. (843) 953-8085
Charleston, SC 29424 Fax (843) 953-5453
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
More information about the R-help
mailing list