[R] how to use 'which' inside of 'apply'?

Mon Oct 17 21:24:30 CEST 2011

Your original code works far faster when the input
is a matrix that when it is a data.frame.  Selecting
a row from a data.frame is a very slow operation,
selecting a row from a matrix is quick.  Modifying
a row or a single element in a data.frame is even
worse compared to do it on a matrix.  I compared your
original code:

f0 <- function (df)
{
    for (i in seq_len(nrow(df))) {
        t = which(df[i, 2:24] > df[i, 25])
        r = min(t)
        df[i, 26] = (r - 1) * 16 + 1
    }
    df[, 26] # for now just return the computed column
}

to one that converts relevant parts of the data.frame
df to matrices or vectors before doing the loop over
rows:

f0.a <- function (df)
{
    thold <- df[, "thold"]
    tmp <- as.matrix(df[,2:24])
    ans <- df[,26]
    for (i in seq_len(nrow(df))) {
        t = which(tmp[i,]>thold[i])
        r = min(t)
        ans[i] = (r-1)*16+1
    }
    # df[,26] <- ans
    ans
}

On a 10,000 row data.frame f0 took 47.950 seconds and f0.a took
0.140 seconds.  (f1, below, took 0.012 seconds.)

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com 

> -----Original Message-----
> From: William Dunlap
> Sent: Monday, October 17, 2011 11:00 AM
> To: Nathan Piekielek
> Cc: r-help at r-project.org
> Subject: RE: [R] how to use 'which' inside of 'apply'?
> 
> Try vectorizing it a bit by looping over the columns.
> E.g.,
> 
>   f1 <- function (df)
>   {
>       # loop (backwards) over all columns in df whose
>       # names start with "D" to find the earliest one
>       # that is bigger than column "thold".  I tested with
>       # df being a data.frame but a matrix should work too.
>       i <- rep(NA_character_, nrow(df))
>       colNames <- grep(value = TRUE, "^D", colnames(df))
>       for (colName in rev(colNames)) {
>           i[df[, colName] > df[, "thold"]] <- colName
>       }
>       # convert column name "D<number>" to <number>.
>       doy <- as.numeric(sub("^D", "", i))
>       doy
>   }
>   > f1(a)
>   [1] 129 145 129 177 177 177
> 
> You could also try looping over rows with something like
> findInterval.  If there are far fewer columns than rows
> then looping over columns is generally faster.
> 
> Your sample data.frame I called 'a' and in copy-and-pastable
> form (from dput()) is
> a <- structure(list(pt = c(39177L, 39178L, 39164L, 39143L, 39144L,
> 39146L), D1 = c(0L, 0L, 0L, 0L, 0L, 0L), D17 = c(0L, 0L, 0L,
> 0L, 0L, 0L), D33 = c(0L, 0L, 0L, 0L, 0L, 0L), D49 = c(0L, 0L,
> 0L, 0L, 0L, 0L), D65 = c(0L, 0L, 0L, 0L, 0L, 0L), D81 = c(0L,
> 0L, 0L, 0L, 0L, 0L), D97 = c(0L, 0L, 0L, 0L, 0L, 0L), D113 = c(0L,
> 0L, 0L, 0L, 0L, 0L), D129 = c(0.4336, 0.342, 0.483, 0.3088, 0.339,
> 0.4232), D145 = c(0.4754, 0.4543, 0.4943, 0.3753, 0.4152, 0.4442
> ), D161 = c(0.5340667, 0.5397666, 0.5740333, 0.4466, 0.5147,
> 0.5084), D177 = c(0.5927334, 0.6252333, 0.6537667, 0.5179, 0.6142,
> 0.5726), D193 = c(0.6514, 0.7107, 0.7335, 0.5892, 0.7137, 0.6368
> ), D209 = c(0.6966, 0.7123, 0.6255, 0.6468, 0.6914, 0.5896),
>     D225 = c(0.59, 0.5591, 0.6228, 0.4794, 0.6381, 0.4703), D241 = c(0.5583,
>     0.4617, 0.5255, 0.4411, 0.5704, 0.4936), D257 = c(0.5676,
>     0.4206, 0.5436, 0.4307, 0.5619, 0.5353), D273 = c(0.4682,
>     0.3867, 0.5541, 0.3632, 0.5347, 0.4067), D289 = c(0.35115,
>     0.2578, 0.46195, 0.34355, 0.4976, 0.39685), D305 = c(0.2341,
>     0.1289, 0.3698, 0.3239, 0.4605, 0.387), D321 = c(0.11705,
>     0, 0.1849, 0, 0, 0), D337 = c(0L, 0L, 0L, 0L, 0L, 0L), D353 = c(0L,
>     0L, 0L, 0L, 0L, 0L), thold = c(0.406825, 0.4206, 0.4592,
>     0.4778, 0.52635, 0.5119), doy = c(0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("pt",
> "D1", "D17", "D33", "D49", "D65", "D81", "D97", "D113", "D129",
> "D145", "D161", "D177", "D193", "D209", "D225", "D241", "D257",
> "D273", "D289", "D305", "D321", "D337", "D353", "thold", "doy"
> ), class = "data.frame", row.names = c("1", "2", "3", "4", "5",
> "6"))
> 
> 
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com
> 
> > -----Original Message-----
> > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of R. Michael
> > Weylandt
> > Sent: Monday, October 17, 2011 10:32 AM
> > To: Nathan Piekielek
> > Cc: r-help at r-project.org
> > Subject: Re: [R] how to use 'which' inside of 'apply'?
> >
> > I think something like this should do it at a huge speed up, though
> > I'd advise you check it to make sure it does exactly what you want:
> > there's also nothing to guarantee that something beats the threshold,
> > so that might make the whole thing fall apart (though I don't think it
> > will)
> >
> > # Sample data
> > df = data.frame(x = sample(5, 15,T),
> > 			y = sample(5, 15, T),
> > 			z = sample(5, 15,T),
> > 			w = (1:5)/2 + 0.5,
> > 			th = (1:5)/2,
> > 			doy = rep(0,15))
> >
> > wd <- which(df[,1:4] > df[,5], arr.ind = TRUE)
> > # identify all elements that beat the threshold value by their indices
> >
> > wd <- wd[!duplicated(wd[,1]),]
> > # select only the first appearance of each "row" value in wd -- this
> > keeps the earliest column beating the threshold
> >
> > wd <- wd[order(wd[,"row"]),]
> > # sort them by row
> >
> > df$doy = (wd[,"col"]-1)*16 + 1
> > # The column transform you used.
> >
> > Hope this helps,
> >
> > Michael
> >
> >
> > On Mon, Oct 17, 2011 at 1:03 PM, Nathan Piekielek <npiekielek at gmail.com> wrote:
> > > Hello R-community,
> > >
> > > I am trying to populate a column (doy) in a large dataset with the first
> > > column number that exceeds the value in another column (thold) using the
> > > 'apply' function.
> > >
> > > Sample data:
> > >     pt D1 D17 D33 D49 D65 D81 D97 D113   D129   D145      D161      D177
> > > D193   D209   D225   D241   D257
> > > 1 39177  0   0   0   0   0   0   0    0 0.4336 0.4754 0.5340667 0.5927334
> > > 0.6514 0.6966 0.5900 0.5583 0.5676
> > > 2 39178  0   0   0   0   0   0   0    0 0.3420 0.4543 0.5397666 0.6252333
> > > 0.7107 0.7123 0.5591 0.4617 0.4206
> > > 3 39164  0   0   0   0   0   0   0    0 0.4830 0.4943 0.5740333 0.6537667
> > > 0.7335 0.6255 0.6228 0.5255 0.5436
> > > 4 39143  0   0   0   0   0   0   0    0 0.3088 0.3753 0.4466000 0.5179000
> > > 0.5892 0.6468 0.4794 0.4411 0.4307
> > > 5 39144  0   0   0   0   0   0   0    0 0.3390 0.4152 0.5147000 0.6142000
> > > 0.7137 0.6914 0.6381 0.5704 0.5619
> > > 6 39146  0   0   0   0   0   0   0    0 0.4232 0.4442 0.5084000 0.5726000
> > > 0.6368 0.5896 0.4703 0.4936 0.5353
> > >    D273    D289   D305    D321 D337 D353    thold doy
> > > 1 0.4682 0.35115 0.2341 0.11705    0    0 0.406825   0
> > > 2 0.3867 0.25780 0.1289 0.00000    0    0 0.420600   0
> > > 3 0.5541 0.46195 0.3698 0.18490    0    0 0.459200   0
> > > 4 0.3632 0.34355 0.3239 0.00000    0    0 0.477800   0
> > > 5 0.5347 0.49760 0.4605 0.00000    0    0 0.526350   0
> > > 6 0.4067 0.39685 0.3870 0.00000    0    0 0.511900   0
> > >
> > > For the first record in above example I would expect doy = 129.
> > >
> > > I can achieve this with the following loop, but it takes several days to run
> > > and there must be a more efficient solution:
> > >
> > > for (i in (1:152000)) {
> > > t=which(data[i,2:24]>data[i,25])
> > > r=min(t)
> > > data[i,26]=(r-1)*16+1
> > > }
> > >
> > > How do I write this using 'apply' or another function that will be more
> > > efficient?
> > >
> > > I have tried the following:
> > > data$doy=apply(which(data[,2:24]>data[,25]),1,min)
> > >
> > > Which returns the following error message:
> > > "Error in apply(which(new[, 2:24] > new[, 25]), 1, min) :
> > >  dim(X) must have a positive length"
> > >
> > > Any help would be much appreciated.
> > >
> > > Nathan
> > >
> > > ______________________________________________
> > > R-help at r-project.org mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > > and provide commented, minimal, self-contained, reproducible code.
> > >
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.