[R] fast way to compare two matrices of combinations
Mark W Kimpel
mwkimpel at gmail.com
Fri Mar 14 01:49:28 CET 2008
Thanks to all for their suggestions. I apologize for not supplying a
self-contained example, I should not post questions when I'm on the way
out the door.
Martin's suggestion should work, but I need to put in on our
high-performance system next week. On my local 64-bit Linux box with 4GB
of RAM it blew up when a vector reached 2.6GB.
I may also get something to work using Charles' suggestion to use R's
intrinsic table functions. I initially could not see how to do this with
a vector of 3 elements, but I believe I can if I sort each vector, to
obviate effects of order, and paste them together to make one unique string.
Once I get something that works and is an optimized as I can make it,
I'll post for future reference and for suggestions on further optimization.
Mark
Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry
Indiana University School of Medicine
15032 Hunter Court, Westfield, IN 46074
(317) 490-5129 Work, & Mobile & VoiceMail
(317) 204-4202 Home (no voice mail please)
mwkimpel<at>gmail<dot>com
******************************************************************
Charles C. Berry wrote:
> On Thu, 13 Mar 2008, Mark W Kimpel wrote:
>
>> I have a list (length 750), each element containing a vector of unique
>> strings (unique gene ids), with length up to ~40 (median 15). I want to
>> compile a matrix of all possible triplets and their frequency within
>> gene elements. Using combn and a lot of looping, I am accomplishing this
>> but it is VERY slow.
>>
>> I've tried to figure out a way to vectorize this, using "match" and
>> "%in%", but can't get my mind around it.
>>
>> Below is my code. sig.tf.pairs is the list. Suggestions?
>
> First, be sure that your code does what you really intend for it to do.
>
> Does this really do what you wanted?
>
> if (length(intersect(triplets[,m], all.triplets[,k] == M))){
>
> If so, then why does the first line below never produce an error?
>
> count.vec <- count.vec[,-redundant.vec]
>
> is.null(dim(count.vec)) ## TRUE
>
> You are basically tabulating. Use the functions that are built for that.
>
> It looks like what you want is along these lines:
>
> tab.combns <- function(x) apply( combn( sort(x), M ),2,
> function(x) paste(x,collapse=''))
>
> tab.all <- table( unlist( lapply(sig.tf.pairs,tab.combns) ) )
>
> Chuck
>>
>> Mark
>>
>>
>> ############################################################
>> M <- 3 # 3 for triplets, etc.
>> ##########################################################
>> # count all triplets
>> all.triplets <- NULL
>> all.count.vec <- NULL
>> for (i in 1:length(sig.tf.pairs)){
>> if (length(sig.tf.pairs[[i]] >= M)){
>> triplets <- combn(sig.tf.pairs[[i]], M, simplify = TRUE)
>> for (j in 1:ncol(triplets)){
>> o <- order(triplets[,j])
>> triplets[,j] <- triplets[o,j]
>> count.vec <- rep(1, ncol(triplets))
>> }
>> if (is.null(all.count.vec)){
>> all.count.vec <- count.vec
>> all.triplets <- triplets
>> } else {
>> redundant.vec <- NULL
>> for (k in 1:ncol(all.triplets)){
>> for (m in 1:ncol(triplets)){
>> if (length(intersect(triplets[,m], all.triplets[,k] == M))){
>> all.count.vec[k] <- all.count.vec[k] + 1
>> redundant.vec <- c(redundant.vec, m)
>> }
>> }
>> }
>> if(!is.null(redundant.vec)){
>> triplets <- triplets[,-redundant.vec]
>> count.vec <- count.vec[,-redundant.vec]
>> }
>> all.triplets <- cbind(all.triplets, triplets)
>> all.count.vec <- c(all.count.vec, count.vec)
>> }
>> }
>> }
>> ###################################
>>
>> --
>>
>> Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry
>> Indiana University School of Medicine
>>
>> 15032 Hunter Court, Westfield, IN 46074
>>
>> (317) 490-5129 Work, & Mobile & VoiceMail
>> (317) 204-4202 Home (no voice mail please)
>>
>> mwkimpel<at>gmail<dot>com
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> Charles C. Berry (858) 534-2098
> Dept of Family/Preventive
> Medicine
> E mailto:cberry at tajo.ucsd.edu UC San Diego
> http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901
>
>
>
More information about the R-help
mailing list