[R] Need a vectorized way to avoid two nested FOR loops

Thu Oct 8 20:14:23 CEST 2009

Bert Gunter wrote:
> If I understand your intent, I believe you can get what you want much faster
> (no interpreted loops and linear times) by looking at this slightly
> differently.
> 
> First of all, the choice of columns is unimportant, as indexing can be used
> to create a data frame containing only the columns of interest. So I think
> you can abstract your request to: group the rows of a data frame so that all
> rows in a group "match."  Now the problem here is exactly what you mean by
> "match." If the data are numeric, finite precision arithmetic requires one
> to ask whether you mean  **exactly equal** or just equal within a tolerance.
> I shall assume the former, but the latter is often what one wants. It is a
> little more difficult to handle, but one way to do it with the present
> approach is to first round to a few digits that represent the tolerance and
> then proceed with the rounded values.
> 
> As always (and as recommended by the posting guide !) a small reproducible
> example is helpful:
> 
> ## Create a data frame with groups of identical rows.
> 
>  z <- data.frame(matrix(rnorm(60),ncol=3))[sample(20,50,repl=TRUE),]
> 
> ## now create a factor column of "id's" in which identical columns
> ## have identical id's (a hash)
> 
> id <- factor(do.call(paste,c(z,sep="+")))
> 
> ## The levels of the factors now "index" groups of rows that "match"
> ## They can be easily accessed in a variety of way, e.g.
> 
> as.numeric(id) 

well, if this is indeed the case, then it is not even required to 
convert it to a factor, i.e.,

vals <- do.call(paste,c(z,sep="+"))
match(vals, sort(unique(vals)))
# or even
match(vals, unique(vals))

which I think is more efficient. However, I'm not sure if this is what 
was in fact requested.

Best,
Dimitris

> ## gives all rows of each group of matching rows
> ## the same integer index.
> 
> etc.
> This all requires only linear time.
> 
> Hope this helps -- or my apologies if I have misinterpreted what was
> requested.
> 
> 
> Bert Gunter
> Genentech Nonclinical Biostatistics
>  
>  
> 
> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
> Behalf Of Dimitris Rizopoulos
> Sent: Thursday, October 08, 2009 6:28 AM
> To: joris meys
> Cc: r-help at r-project.org; Rama Ramakrishnan
> Subject: Re: [R] Need a vectorized way to avoid two nested FOR loops
> 
> Another approach is:
> 
> n <- 20
> set.seed(2)
> x <- as.data.frame(matrix(sample(1:2, n*6, TRUE), nrow = n))
> x.col <- c(1, 3, 5)
> 
> values <- do.call(paste, c(x[x.col], sep = "\r"))
> out <- lapply(seq_along(ind), function (i) {
>      ind <- which(values == values[i])
>      ind[!ind %in% i]
> })
> out
> 
> 
> Best,
> Dimitris
> 
> 
> joris meys wrote:
>> Neat piece of code, Jim, but it still uses a nested loop. If you order
>> the matrix first, you only need one passage through the whole matrix
>> to find the information you need.
>>
>> Off course I don't take into account the ordering. If the ordering
>> algorithm doesn't work in linear time, then it doesn't really matter I
>> guess. The limiting step would become the ordering algorithm.
>>
>> Kind regards
>> Joris
>>
>>
>>
>> On Thu, Oct 8, 2009 at 2:24 PM, jim holtman <jholtman at gmail.com> wrote:
>>> I answered the wrong question.  Here is the code to find all the
>>> matches for each row:
>>>
>>> n <- 20
>>> set.seed(2)
>>> # create test dataframe
>>> x <- as.data.frame(matrix(sample(1:2,n*6, TRUE), nrow=n))
>>> x
>>> x.col <- c(1,3,5)
>>>
>>> # match against all the other rows
>>> x.match1 <- apply(x[, x.col], 1, function(a){
>>>    .mat <- which(apply(x[, x.col], 1, function(z){
>>>        all(a == z)
>>>    }))
>>> })
>>>
>>> # remove matches to itself
>>> x.match2 <- lapply(seq(length(x.match1)), function(z){
>>>    x.match1[[z]][!(x.match1[[z]] %in% z)]
>>> })
>>> # x.match2 contains which rows indices match
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Oct 7, 2009 at 3:52 PM, Rama Ramakrishnan <rama at alum.mit.edu>
> wrote:
>>>> Hi Friends,
>>>>
>>>> I have a data frame d. Let vars be the column indices for a subset of
> the
>>>> columns in d (e.g., vars <- c(1,3,4,8))
>>>>
>>>> For each row r in d, I want to collect all the other rows in d that
> match
>>>> the values in row r for just the columns in vars.
>>>>
>>>> The naive way to do this is to have a for loop stepping through each row
> in
>>>> d, and within the loop have another loop going through all the rows
> again,
>>>> checking for equality. This is quadratic in the number of rows and takes
> way
>>>> too long. Is there a better, "vectorized" way to do this?
>>>>
>>>> Thanks in advance!
>>>>
>>>> Rama Ramakrishnan
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>
>>> --
>>> Jim Holtman
>>> Cincinnati, OH
>>> +1 513 646 9390
>>>
>>> What is the problem that you are trying to solve?
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
> 

-- 
Dimitris Rizopoulos
Assistant Professor
Department of Biostatistics
Erasmus University Medical Center

Address: PO Box 2040, 3000 CA Rotterdam, the Netherlands
Tel: +31/(0)10/7043478
Fax: +31/(0)10/7043014