[R] Comparing matrices in R - matrixB %in% matrixA

Jeff Newmiller jdnewmil at dcn.davis.ca.us
Fri Oct 31 16:27:50 CET 2014


Since both of you seem to have misinterpreted my response, consider the 
following for clarification:

> A <- matrix(1:1000, 1000, 10)
> B <- A[1:100, ]
> # my recommended solution
> t1 <- system.time({match(as.data.frame(t(B)), as.data.frame(t(A)))})
> # similar to John's recommended solution
> t2 <- system.time({
+   AA <- as.list(as.data.frame(t(A)))
+   BB <- as.list(as.data.frame(t(B)))
+   which( AA %in% BB )
+ })
> t3 <- system.time({
+   lresult <- rep( NA, nrow(A) )
+   for ( ia in seq.int( nrow( A ) ) ) {
+     lres <- FALSE
+     ib <- 0
+     while ( ib < nrow( B ) & !lres ) {
+       ib <- ib + 1
+       lres <- all( A[ ia, ] == B[ ib, ] )
+     }
+     lresult[ ia ] <- lres
+   }
+   which( lresult )
+ })
> t4 <- system.time({
+   res<-c()
+   rowsB = length(B[,1])
+   rowsA = length(A[,1])
+   colsB = length(B[1,])
+   colsA = length(A[1,])
+   for (i in 1:rowsB){
+     for (j in 1:colsB){
+       for (k in 1:rowsA){
+         for (l in 1:colsA){
+           if(A[k,l]==B[i,j]){res<-c(res,k)}
+         }
+       }
+     }
+   }
+   unique(sort(res))
+ })
> t1
    user  system elapsed
   0.022   0.000   0.020
> t2
    user  system elapsed
    0.02    0.00    0.02
> t3
    user  system elapsed
   0.748   0.000   0.746
> t4
    user  system elapsed
  16.612   0.016  16.636
> # data.frames are lists, but applying as.list seems to speed up the 
> # match for some reason
> t2[1]/t1[1]
user.self
0.9090909
> # intended comparison for learning purposes
> t4[1]/t3[1]
user.self
  22.20856

I recognize that the reference implementation does not need to be 
optimized, but the changes I suggested to it illustrate an incremental 
improvement toward "thinking in R" rather than the optimal solution.

On Fri, 31 Oct 2014, John Fox wrote:

> Dear Jeff,
>
> For curiosity, I compared your solution with the one I posted earlier this morning (when I was working on a slower computer, accounting for the somewhat different timings for my solution):
>
> ------------ snip ----------
>
>> A <- matrix(1:10000, 10000, 10)
>> B <- A[1:1000, ]
>>
>> system.time({
> +    AA <- as.list(as.data.frame(t(A)))
> +    BB <- as.list(as.data.frame(t(B)))
> +    print(sum(AA %in% BB))
> +  })
> [1] 1000
>   user  system elapsed
>   0.14    0.01    0.16
>>
>>
>> system.time({
> +     lresult <- rep( NA, nrow(A) )
> +     for ( ia in seq.int( nrow( A ) ) ) {
> +         lres <- FALSE
> +         ib <- 0
> +         while ( ib < nrow( B ) & !lres ) {
> +             ib <- ib + 1
> +             lres <- all( A[ ia, ] == B[ ib, ] )
> +         }
> +         lresult[ ia ] <- lres
> +     }
> +     print(sum( lresult ))
> + })
> [1] 1000
>   user  system elapsed
>  45.76    0.01   45.77
>> 46/0.16
> [1] 287.5
>
> ------------ snip ----------
>
> So the solution using nested loops is more than 2 orders of magnitude slower for this problem. Of course, for a one-off problem, depending on its size, the difference may not matter.
>
> Best,
> John
>
> -----------------------------------------------
> John Fox, Professor
> McMaster University
> Hamilton, Ontario, Canada
> http://socserv.socsci.mcmaster.ca/jfox/
>
>
>
>> -----Original Message-----
>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
>> project.org] On Behalf Of Jeff Newmiller
>> Sent: Friday, October 31, 2014 10:15 AM
>> To: Charles Novaes de Santana; r-help at r-project.org
>> Subject: Re: [R] Comparing matrices in R - matrixB %in% matrixA
>>
>> Thank you for the reproducible example, but posting in HTML can corrupt
>> your example code so please learn to set your email client mail format
>> appropriately when posting to this list.
>>
>> I think this [1] post, found with a quick Google search for "R match
>> matrix", fits your situation perfectly.
>>
>> match(data.frame(t(B)), data.frame(t(A)))
>>
>> Note that concatenating vectors in loops is bad news... a basic
>> optimization for your code would be to preallocate a logical result
>> vector and fill in each element with a TRUE/FALSE in the outer loop,
>> and use the which() function on that completed vector to identify the
>> index numbers (if you really need that). For example:
>>
>> lresult <- rep( NA, nrow(A) )
>> for ( ia in seq.int( nrow( A ) ) ) {
>>   lres <- FALSE
>>   ib <- 0
>>   while ( ib < nrow( B ) & !lres ) {
>>     ib <- ib + 1
>>     lres <- all( A[ ia, ] == B[ ib, ] )
>>   }
>>   lresult[ ia ] <- lres
>> }
>> result <- which( lresult )
>>
>> [1] http://stackoverflow.com/questions/12697122/in-r-match-function-
>> for-rows-or-columns-of-matrix
>> -----------------------------------------------------------------------
>> ----
>> Jeff Newmiller                        The     .....       .....  Go
>> Live...
>> DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live
>> Go...
>>                                       Live:   OO#.. Dead: OO#..
>> Playing
>> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
>> /Software/Embedded Controllers)               .OO#.       .OO#.
>> rocks...1k
>> -----------------------------------------------------------------------
>> ----
>> Sent from my phone. Please excuse my brevity.
>>
>> On October 31, 2014 6:20:38 AM PDT, Charles Novaes de Santana
>> <charles.santana at gmail.com> wrote:
>>> My apologies, because I sent the message before finishing it. i am
>> very
>>> sorry about this. Please find below my message (I use to write the
>>> messages
>>> from the end to the beginning... sorry :)).
>>>
>>> Dear all,
>>>
>>> I am trying to compare two matrices, in order to find in which rows of
>>> a
>>> matrix A I can find the same values as in matrix B. I am trying to do
>>> it
>>> for matrices with around 2500 elements, but please find below a toy
>>> example:
>>>
>>> A = matrix(1:10,nrow=5)
>>> B = A[-c(1,2,3),];
>>>
>>> So
>>>> A
>>>     [,1] [,2]
>>> [1,]    1    6
>>> [2,]    2    7
>>> [3,]    3    8
>>> [4,]    4    9
>>> [5,]    5   10
>>>
>>> and
>>>> B
>>>     [,1] [,2]
>>> [1,]    4    9
>>> [2,]    5   10
>>>
>>> I would like to compare A and B in order to find in which rows of A I
>>> can
>>> find the  rows of B. Something similar to %in% with one dimensional
>>> arrays.
>>> In the example above, the answer should be 4 and 5.
>>>
>>> I did a function to do it (see it below), it gives me the correct
>>> answer
>>> for this toy example, but the excess of for-loops makes it extremely
>>> slow
>>> for larger matrices. I was wondering if there is a better way to do
>>> this
>>> kind of comparison. Any idea? Sorry if it is a stupid question.
>>>
>>> matbinmata<-function(B,A){
>>>    res<-c();
>>>    rowsB = length(B[,1]);
>>>    rowsA = length(A[,1]);
>>>    colsB = length(B[1,]);
>>>    colsA = length(A[1,]);
>>>    for (i in 1:rowsB){
>>>        for (j in 1:colsB){
>>>            for (k in 1:rowsA){
>>>                for (l in 1:colsA){
>>>                    if(A[k,l]==B[i,j]){res<-c(res,k);}
>>>                }
>>>            }
>>>        }
>>>    }
>>>    return(unique(sort(res)));
>>> }
>>>
>>>
>>> Best,
>>>
>>> Charles
>>>
>>> On Fri, Oct 31, 2014 at 2:12 PM, Charles Novaes de Santana <
>>> charles.santana at gmail.com> wrote:
>>>
>>>> A = matrix(1:10,nrow=5)
>>>> B = A[-c(1,2,3),];
>>>>
>>>> So
>>>>> A
>>>>      [,1] [,2]
>>>> [1,]    1    6
>>>> [2,]    2    7
>>>> [3,]    3    8
>>>> [4,]    4    9
>>>> [5,]    5   10
>>>>
>>>> and
>>>>> B
>>>>      [,1] [,2]
>>>> [1,]    4    9
>>>> [2,]    5   10
>>>>
>>>> I would like to compare A and B in order to find in which rows of A
>> I
>>> can
>>>> find the  rows of B. Something similar to %in% with one dimensional
>>> arrays.
>>>> In the example above, the answer should be 4 and 5.
>>>>
>>>> I did a function to do it (see it below), it gives me the correct
>>> answer
>>>> for this toy example, but the excess of for-loops makes it extremely
>>> slow
>>>> for larger matrices. I was wondering if there is a better way to do
>>> this
>>>> kind of comparison. Any idea? Sorry if it is a stupid question.
>>>>
>>>> matbinmata<-function(B,A){
>>>>     res<-c();
>>>>     rowsB = length(B[,1]);
>>>>     rowsA = length(A[,1]);
>>>>     colsB = length(B[1,]);
>>>>     colsA = length(A[1,]);
>>>>     for (i in 1:rowsB){
>>>>         for (j in 1:colsB){
>>>>             for (k in 1:rowsA){
>>>>                 for (l in 1:colsA){
>>>>                     if(A[k,l]==B[i,j]){res<-c(res,k);}
>>>>                 }
>>>>             }
>>>>         }
>>>>     }
>>>>     return(unique(sort(res)));
>>>> }
>>>>
>>>>
>>>> Best,
>>>>
>>>> Charles
>>>>
>>>>
>>>> --
>>>> Um ax?! :)
>>>>
>>>> --
>>>> Charles Novaes de Santana, PhD
>>>> http://www.imedea.uib-csic.es/~charles
>>>>
>>>
>>>
>>>
>>> --
>>> Um ax?! :)
>>>
>>> --
>>> Charles Novaes de Santana, PhD
>>> http://www.imedea.uib-csic.es/~charles
>>>
>>> 	[[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-
>> guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>

---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                       Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k



More information about the R-help mailing list