[R] Need a vectorized way to avoid two nested FOR loops

Thu Oct 8 19:41:38 CEST 2009

Bert, Jim, Dimitris and Joris,

Thank you all very much for your prompt help and suggestions.

After trying the ideas out, I have decided to go with Bert's approach  
since it is by far the fastest of the lot.

Thanks again!

Rama Ramakrishnan

On Oct 8, 2009, at 12:49 PM, Bert Gunter wrote:

>
> If I understand your intent, I believe you can get what you want  
> much faster
> (no interpreted loops and linear times) by looking at this slightly
> differently.
>
> First of all, the choice of columns is unimportant, as indexing can  
> be used
> to create a data frame containing only the columns of interest. So I  
> think
> you can abstract your request to: group the rows of a data frame so  
> that all
> rows in a group "match."  Now the problem here is exactly what you  
> mean by
> "match." If the data are numeric, finite precision arithmetic  
> requires one
> to ask whether you mean  **exactly equal** or just equal within a  
> tolerance.
> I shall assume the former, but the latter is often what one wants.  
> It is a
> little more difficult to handle, but one way to do it with the present
> approach is to first round to a few digits that represent the  
> tolerance and
> then proceed with the rounded values.
>
> As always (and as recommended by the posting guide !) a small  
> reproducible
> example is helpful:
>
> ## Create a data frame with groups of identical rows.
>
> z <- data.frame(matrix(rnorm(60),ncol=3))[sample(20,50,repl=TRUE),]
>
> ## now create a factor column of "id's" in which identical columns
> ## have identical id's (a hash)
>
> id <- factor(do.call(paste,c(z,sep="+")))
>
> ## The levels of the factors now "index" groups of rows that "match"
> ## They can be easily accessed in a variety of way, e.g.
>
> as.numeric(id)
> ## gives all rows of each group of matching rows
> ## the same integer index.
>
> etc.
> This all requires only linear time.
>
> Hope this helps -- or my apologies if I have misinterpreted what was
> requested.
>
>
> Bert Gunter
> Genentech Nonclinical Biostatistics
>
>
>
> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- 
> project.org] On
> Behalf Of Dimitris Rizopoulos
> Sent: Thursday, October 08, 2009 6:28 AM
> To: joris meys
> Cc: r-help at r-project.org; Rama Ramakrishnan
> Subject: Re: [R] Need a vectorized way to avoid two nested FOR loops
>
> Another approach is:
>
> n <- 20
> set.seed(2)
> x <- as.data.frame(matrix(sample(1:2, n*6, TRUE), nrow = n))
> x.col <- c(1, 3, 5)
>
> values <- do.call(paste, c(x[x.col], sep = "\r"))
> out <- lapply(seq_along(ind), function (i) {
>     ind <- which(values == values[i])
>     ind[!ind %in% i]
> })
> out
>
>
> Best,
> Dimitris
>
>
> joris meys wrote:
>> Neat piece of code, Jim, but it still uses a nested loop. If you  
>> order
>> the matrix first, you only need one passage through the whole matrix
>> to find the information you need.
>>
>> Off course I don't take into account the ordering. If the ordering
>> algorithm doesn't work in linear time, then it doesn't really  
>> matter I
>> guess. The limiting step would become the ordering algorithm.
>>
>> Kind regards
>> Joris
>>
>>
>>
>> On Thu, Oct 8, 2009 at 2:24 PM, jim holtman <jholtman at gmail.com>  
>> wrote:
>>> I answered the wrong question.  Here is the code to find all the
>>> matches for each row:
>>>
>>> n <- 20
>>> set.seed(2)
>>> # create test dataframe
>>> x <- as.data.frame(matrix(sample(1:2,n*6, TRUE), nrow=n))
>>> x
>>> x.col <- c(1,3,5)
>>>
>>> # match against all the other rows
>>> x.match1 <- apply(x[, x.col], 1, function(a){
>>>   .mat <- which(apply(x[, x.col], 1, function(z){
>>>       all(a == z)
>>>   }))
>>> })
>>>
>>> # remove matches to itself
>>> x.match2 <- lapply(seq(length(x.match1)), function(z){
>>>   x.match1[[z]][!(x.match1[[z]] %in% z)]
>>> })
>>> # x.match2 contains which rows indices match
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Oct 7, 2009 at 3:52 PM, Rama Ramakrishnan  
>>> <rama at alum.mit.edu>
> wrote:
>>>> Hi Friends,
>>>>
>>>> I have a data frame d. Let vars be the column indices for a  
>>>> subset of
> the
>>>> columns in d (e.g., vars <- c(1,3,4,8))
>>>>
>>>> For each row r in d, I want to collect all the other rows in d that
> match
>>>> the values in row r for just the columns in vars.
>>>>
>>>> The naive way to do this is to have a for loop stepping through  
>>>> each row
> in
>>>> d, and within the loop have another loop going through all the rows
> again,
>>>> checking for equality. This is quadratic in the number of rows  
>>>> and takes
> way
>>>> too long. Is there a better, "vectorized" way to do this?
>>>>
>>>> Thanks in advance!
>>>>
>>>> Rama Ramakrishnan
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>
>>>
>>> --
>>> Jim Holtman
>>> Cincinnati, OH
>>> +1 513 646 9390
>>>
>>> What is the problem that you are trying to solve?
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> -- 
> Dimitris Rizopoulos
> Assistant Professor
> Department of Biostatistics
> Erasmus University Medical Center
>
> Address: PO Box 2040, 3000 CA Rotterdam, the Netherlands
> Tel: +31/(0)10/7043478
> Fax: +31/(0)10/7043014
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>