[R] fast way to compare two matrices of combinations

Mark W Kimpel mwkimpel at gmail.com
Fri Mar 14 01:49:28 CET 2008


Thanks to all for their suggestions. I apologize for not supplying a 
self-contained example, I should not post questions when I'm on the way 
out the door.

Martin's suggestion should work, but I need to put in on our 
high-performance system next week. On my local 64-bit Linux box with 4GB 
of RAM it blew up when a vector reached 2.6GB.

I may also get something to work using Charles' suggestion to use R's 
intrinsic table functions. I initially could not see how to do this with 
  a vector of 3 elements, but I believe I can if I sort each vector, to 
obviate effects of order, and paste them together to make one unique string.

Once I get something that works and is an optimized as I can make it, 
I'll post for future reference and for suggestions on further optimization.

Mark

Mark W. Kimpel MD  ** Neuroinformatics ** Dept. of Psychiatry
Indiana University School of Medicine

15032 Hunter Court, Westfield, IN  46074

(317) 490-5129 Work, & Mobile & VoiceMail
(317) 204-4202 Home (no voice mail please)

mwkimpel<at>gmail<dot>com

******************************************************************


Charles C. Berry wrote:
> On Thu, 13 Mar 2008, Mark W Kimpel wrote:
> 
>> I have a list (length 750), each element containing a vector of unique
>> strings (unique gene ids), with length up to ~40 (median 15). I want to
>> compile a matrix of all possible triplets and their frequency within
>> gene elements. Using combn and a lot of looping, I am accomplishing this
>> but it is VERY slow.
>>
>> I've tried to figure out a way to vectorize this, using "match" and
>> "%in%", but can't get my mind around it.
>>
>> Below is my code. sig.tf.pairs is the list. Suggestions?
> 
> First, be sure that your code does what you really intend for it to do.
> 
> Does this really do what you wanted?
> 
>       if (length(intersect(triplets[,m], all.triplets[,k] == M))){
> 
> If so, then why does the first line below never produce an error?
> 
>      count.vec <- count.vec[,-redundant.vec]
> 
>     is.null(dim(count.vec)) ## TRUE
> 
> You are basically tabulating. Use the functions that are built for that.
> 
> It looks like what you want is along these lines:
> 
>     tab.combns <- function(x) apply( combn( sort(x), M ),2,
>                                     function(x) paste(x,collapse=''))
> 
>     tab.all <- table( unlist( lapply(sig.tf.pairs,tab.combns) ) )
> 
> Chuck
>>
>> Mark
>>
>>
>> ############################################################
>> M <- 3 # 3 for triplets, etc.
>> ##########################################################
>> # count all triplets
>> all.triplets <- NULL
>> all.count.vec <- NULL
>> for (i in 1:length(sig.tf.pairs)){
>>   if (length(sig.tf.pairs[[i]] >= M)){
>>     triplets <- combn(sig.tf.pairs[[i]], M, simplify = TRUE)
>>     for (j in 1:ncol(triplets)){
>>       o <- order(triplets[,j])
>>       triplets[,j] <- triplets[o,j]
>>       count.vec <- rep(1, ncol(triplets))
>>     }
>>     if (is.null(all.count.vec)){
>>       all.count.vec <- count.vec
>>       all.triplets <- triplets
>>     } else {
>>       redundant.vec <- NULL
>>       for (k in 1:ncol(all.triplets)){
>>         for (m in 1:ncol(triplets)){
>>           if (length(intersect(triplets[,m], all.triplets[,k] == M))){
>>             all.count.vec[k] <- all.count.vec[k] + 1
>>             redundant.vec <- c(redundant.vec, m)
>>           }
>>         }
>>       }
>>       if(!is.null(redundant.vec)){
>>         triplets <- triplets[,-redundant.vec]
>>         count.vec <- count.vec[,-redundant.vec]
>>       }
>>       all.triplets <- cbind(all.triplets, triplets)
>>       all.count.vec <- c(all.count.vec, count.vec)
>>     }
>>   }
>> }
>> ###################################
>>
>> -- 
>>
>> Mark W. Kimpel MD  ** Neuroinformatics ** Dept. of Psychiatry
>> Indiana University School of Medicine
>>
>> 15032 Hunter Court, Westfield, IN  46074
>>
>> (317) 490-5129 Work, & Mobile & VoiceMail
>> (317) 204-4202 Home (no voice mail please)
>>
>> mwkimpel<at>gmail<dot>com
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide 
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
> 
> Charles C. Berry                            (858) 534-2098
>                                             Dept of Family/Preventive 
> Medicine
> E mailto:cberry at tajo.ucsd.edu                UC San Diego
> http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901
> 
> 
>



More information about the R-help mailing list