[R] millions of comparisons, speed wanted

Liaw, Andy andy_liaw at merck.com
Thu Dec 15 19:57:37 CET 2005


Just some untested idea:

If the data are all 0/1, you could use dist(input, method="manhattan"), and
then check which entry equals 1.  This should be much faster than creating
all pairs of rows and check position-by-position.

HTH,
Andy

From: Adrian DUSA
> 
> Dear all,
> 
> I have a 10 columns matrix which has 2^10=1024 unique rows, 
> with values 0 and 
> 1.
> What I would like to do is a little complicated; in a simple 
> statement, for a 
> subset (say 1000 rows) I want to perform pairwise comparisons 
> between all 
> rows, find out which rows differ by only one  value, replace 
> that value by 
> "x", get rid of the comparisons that differ by more than one 
> value and repeat 
> the algorithm until no further minimization is possible. Any 
> row that hasn't 
> been minimized is kept for the next iteration.
> 
> For 1,000 rows, there are almost 500,000 pairs, but in the 
> next iterations I 
> could get some 5,000 rows which generates something like 12.5 
> millions pairs, 
> and that takes a _very_ long time.
> 
> The code I have created (10 lines, below) is super fast 
> (using vectorization) 
> but only for a reasonable number of rows. I am searching for:
> - ways to improve my code (if possible)
> - ideas: create a C code for the slow parts of the code? use 
> MySQL? other 
> ways?
> 
> As a toy example, having an input matrix called "input", my 
> algorithm looks 
> like this:
> 
> ## code start
> ncolumns <- 6
> input <- bincombinations(ncolumns) # from package e1071
> # subset, let's say 97% of rows
> input <- input[sample(2^ncolumns, round(2^ncolumns*0.97, 0), ]
> minimized <- 1
> 
> while (sum(minimized) > 0) {
> 
>    minimized <- logical(nrow(input))
> 
>    to.be.compared <- combn2(1:nrow(input)) # from package combinat
>    
>    # the following line takes _a lot_ of time, for millions 
> of comparisons
>    logical.result <- apply(to.be.compared, 1, function(idx) 
> input[idx[1], ] == 
> input[idx[2], ])
> 
>    compare.minimized <- which(colSums(!logical.result) == 1)
> 
>    logical.result <- logical.result[, compare.minimized]
> 
>    result <- sapply(compare.minimized, function(idx) 
> input[to.be.compared[idx, 
> 1], ])
> 
>    result[!logical.result] <- "x"
> 
>    
> minimized[unique(as.vector(to.be.compared[compare.minimized, 
> ]))] <- TRUE
> 
>    if (sum(minimized) > 0) {
>       input <- rbind(input[!minimized, ], unique(t(result)))
>    }
> }
> ## code end
> 
> Any suggestion is welcomed, thank you very much in advance.
> Adrian
> 
> -- 
> Adrian DUSA
> Romanian Social Data Archive
> 1, Schitu Magureanu Bd
> 050025 Bucharest sector 5
> Romania
> Tel./Fax: +40 21 3126618 \
>           +40 21 3120210 / int.101
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
> 
>




More information about the R-help mailing list