[R] Need a vectorized way to avoid two nested FOR loops
Dimitris Rizopoulos
d.rizopoulos at erasmusmc.nl
Thu Oct 8 20:14:23 CEST 2009
Bert Gunter wrote:
> If I understand your intent, I believe you can get what you want much faster
> (no interpreted loops and linear times) by looking at this slightly
> differently.
>
> First of all, the choice of columns is unimportant, as indexing can be used
> to create a data frame containing only the columns of interest. So I think
> you can abstract your request to: group the rows of a data frame so that all
> rows in a group "match." Now the problem here is exactly what you mean by
> "match." If the data are numeric, finite precision arithmetic requires one
> to ask whether you mean **exactly equal** or just equal within a tolerance.
> I shall assume the former, but the latter is often what one wants. It is a
> little more difficult to handle, but one way to do it with the present
> approach is to first round to a few digits that represent the tolerance and
> then proceed with the rounded values.
>
> As always (and as recommended by the posting guide !) a small reproducible
> example is helpful:
>
> ## Create a data frame with groups of identical rows.
>
> z <- data.frame(matrix(rnorm(60),ncol=3))[sample(20,50,repl=TRUE),]
>
> ## now create a factor column of "id's" in which identical columns
> ## have identical id's (a hash)
>
> id <- factor(do.call(paste,c(z,sep="+")))
>
> ## The levels of the factors now "index" groups of rows that "match"
> ## They can be easily accessed in a variety of way, e.g.
>
> as.numeric(id)
well, if this is indeed the case, then it is not even required to
convert it to a factor, i.e.,
vals <- do.call(paste,c(z,sep="+"))
match(vals, sort(unique(vals)))
# or even
match(vals, unique(vals))
which I think is more efficient. However, I'm not sure if this is what
was in fact requested.
Best,
Dimitris
> ## gives all rows of each group of matching rows
> ## the same integer index.
>
> etc.
> This all requires only linear time.
>
> Hope this helps -- or my apologies if I have misinterpreted what was
> requested.
>
>
> Bert Gunter
> Genentech Nonclinical Biostatistics
>
>
>
> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
> Behalf Of Dimitris Rizopoulos
> Sent: Thursday, October 08, 2009 6:28 AM
> To: joris meys
> Cc: r-help at r-project.org; Rama Ramakrishnan
> Subject: Re: [R] Need a vectorized way to avoid two nested FOR loops
>
> Another approach is:
>
> n <- 20
> set.seed(2)
> x <- as.data.frame(matrix(sample(1:2, n*6, TRUE), nrow = n))
> x.col <- c(1, 3, 5)
>
> values <- do.call(paste, c(x[x.col], sep = "\r"))
> out <- lapply(seq_along(ind), function (i) {
> ind <- which(values == values[i])
> ind[!ind %in% i]
> })
> out
>
>
> Best,
> Dimitris
>
>
> joris meys wrote:
>> Neat piece of code, Jim, but it still uses a nested loop. If you order
>> the matrix first, you only need one passage through the whole matrix
>> to find the information you need.
>>
>> Off course I don't take into account the ordering. If the ordering
>> algorithm doesn't work in linear time, then it doesn't really matter I
>> guess. The limiting step would become the ordering algorithm.
>>
>> Kind regards
>> Joris
>>
>>
>>
>> On Thu, Oct 8, 2009 at 2:24 PM, jim holtman <jholtman at gmail.com> wrote:
>>> I answered the wrong question. Here is the code to find all the
>>> matches for each row:
>>>
>>> n <- 20
>>> set.seed(2)
>>> # create test dataframe
>>> x <- as.data.frame(matrix(sample(1:2,n*6, TRUE), nrow=n))
>>> x
>>> x.col <- c(1,3,5)
>>>
>>> # match against all the other rows
>>> x.match1 <- apply(x[, x.col], 1, function(a){
>>> .mat <- which(apply(x[, x.col], 1, function(z){
>>> all(a == z)
>>> }))
>>> })
>>>
>>> # remove matches to itself
>>> x.match2 <- lapply(seq(length(x.match1)), function(z){
>>> x.match1[[z]][!(x.match1[[z]] %in% z)]
>>> })
>>> # x.match2 contains which rows indices match
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Oct 7, 2009 at 3:52 PM, Rama Ramakrishnan <rama at alum.mit.edu>
> wrote:
>>>> Hi Friends,
>>>>
>>>> I have a data frame d. Let vars be the column indices for a subset of
> the
>>>> columns in d (e.g., vars <- c(1,3,4,8))
>>>>
>>>> For each row r in d, I want to collect all the other rows in d that
> match
>>>> the values in row r for just the columns in vars.
>>>>
>>>> The naive way to do this is to have a for loop stepping through each row
> in
>>>> d, and within the loop have another loop going through all the rows
> again,
>>>> checking for equality. This is quadratic in the number of rows and takes
> way
>>>> too long. Is there a better, "vectorized" way to do this?
>>>>
>>>> Thanks in advance!
>>>>
>>>> Rama Ramakrishnan
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>
>>> --
>>> Jim Holtman
>>> Cincinnati, OH
>>> +1 513 646 9390
>>>
>>> What is the problem that you are trying to solve?
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
--
Dimitris Rizopoulos
Assistant Professor
Department of Biostatistics
Erasmus University Medical Center
Address: PO Box 2040, 3000 CA Rotterdam, the Netherlands
Tel: +31/(0)10/7043478
Fax: +31/(0)10/7043014
More information about the R-help
mailing list