[R] subset data.frame with value != in all columns

Tim Howard tghoward at gw.dec.state.ny.us
Mon Feb 7 14:47:48 CET 2005


Petr,
   Thank you!  Yes, rowSums appears to be even a little bit faster than
unique(which()), and it also maintains the original order. I do want
original order maintained, but I first apply a function to one of my
data.frames (that without any -9999s ... yes, these do represent nulls,
as someone asked earlier) and rbind these two dataframes back together,
so I need to sort (by rownames) after the rbind (there doesn't seem to
be a sortby option in rbind). 
   I apologize for not jumping on rowSums earlier, I hadn't caught on
that it was summing counts of occurrence of the search value, not
summing the search value itself.
   Thanks again, this is very instructive and *very* helpful.
humbly,
Tim

>>> "Petr Pikal" <petr.pikal at precheza.cz> 02/07/05 02:12AM >>>
Hi Tim

I can not say much about apply, but the code with unique(which()) 
gives you reordered rows in case of -9999 selection

try

set.seed(1)
in.df <- data.frame(
c1=rnorm(40000),
c2=rnorm(40000),
c3=rnorm(40000),
c4=rnorm(40000),
c5=rnorm(40000))
in.df[in.df>3] <- (-9999)

system.time(e <- in.df[unique(which(in.df == -9999, arr.ind = 
TRUE)[,1]), ])
system.time(e1 <- in.df[(rowSums(in.df == -9999)) != 0,])

all.equal(e,e1)

So if you mind you need to do reordering.

ooo<-order(as.numeric(rownames(e)))
all.equal(e[ooo,],e1)

Cheers
Petr

On 4 Feb 2005 at 11:17, Tim Howard wrote:

> Because I'll be doing this on big datasets and time is important, I
> thought I'd time all the different approaches that were suggested on
a
> small dataframe. The results were very instructive so I thought I'd
> pass them on. I also discovered that my numeric columns (e.g.
> -9999.000) weren't found by apply() but were found by which() and
the
> simple replace. Was it apply's fault or something else?
> 
> Note how much faster unique(which()) is; wow! Thanks to Marc
Schwartz
> for this blazing solution.
> 
> > nrow(in.df)
> [1] 40000
> #extract rows with no -9999
> > system.time(x <- subset(in.df, apply(in.df, 1,
> function(in.df){all(in.df != -9999)})))
> [1] 3.25 0.00 3.25   NA   NA
> > system.time(y<- in.df[-unique(which(in.df == -9999, arr.ind =
> > TRUE)[,
> 1]), ])
> [1] 0.17 0.00 0.17   NA   NA
> > system.time({is.na(in.df) <-in.df == -9999; z <- na.omit(in.df)})
> [1] 0.25 0.02 0.26   NA   NA
> 
> > nrow(x);nrow(y);nrow(z)
> [1] 39990
> [1] 39626
> [1] 39626
> 
> #extract rows with -9999
> > system.time(d<-subset(in.df, apply(in.df, 1,
> function(in.df){any(in.df == -9999)})))
> [1] 3.40 0.00 3.45   NA   NA
> > system.time(e<-in.df[unique(which(in.df == -9999, arr.ind =
TRUE)[,
> 1]), ])
> [1] 0.11 0.00 0.11   NA   NA
> 
> > nrow(d); nrow(e)
> [1] 10
> [1] 374
> 
> Tim Howard
> 
> 
> >>> Marc Schwartz <MSchwartz at MedAnalytics.com> 02/03/05 03:24PM >>>
> On Thu, 2005-02-03 at 14:57 -0500, Tim Howard wrote: 
>   ... snip...
> > My questions: 
> > Is there a cleaner way to extract all rows containing a specified
> > value? How can I extract all rows that don't have this value in
any
> > col?
> > 
> > #create dummy dataset
> > x <- data.frame(
> > c1=c(-99,-99,-99,4:10),
> > c2=1:10,
> > c3=c(1:3,-99,5:10),
> > c4=c(10:1),
> > c5=c(1:9,-99))
> > 
> ..snip...
> 
> How about this, presuming that your data frame is all numeric:
> 
> For rows containing -99:
> 
> > x[unique(which(x == -99, arr.ind = TRUE)[, 1]), ]
>     c1 c2  c3 c4  c5
> 1  -99  1   1 10   1
> 2  -99  2   2  9   2
> 3  -99  3   3  8   3
> 4    4  4 -99  7   4
> 10  10 10  10  1 -99
> 
> 
> For rows not containing -99:
> 
> > x[-unique(which(x == -99, arr.ind = TRUE)[, 1]), ]
>   c1 c2 c3 c4 c5
> 5  5  5  5  6  5
> 6  6  6  6  5  6
> 7  7  7  7  4  7
> 8  8  8  8  3  8
> 9  9  9  9  2  9
> 
> 
> What I have done here is to use which(), setting arr.ind = TRUE.
This
> returns the row, column indices for the matches to the boolean
> statement. The first column returned by which() in this case are the
> row numbers matching the statement, so I take the first column only.
> 
> Since it is possible that more than one element in a row can match
the
> boolean, I then use unique() to get the singular row values.
> 
> Thus, I can use the returned row indices above to subset the data
> frame.
> 
> HTH,
> 
> Marc Schwartz
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help 
> PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html 

Petr Pikal
petr.pikal at precheza.cz




More information about the R-help mailing list