[R] subset data.frame with value != in all columns
Tim Howard
tghoward at gw.dec.state.ny.us
Mon Feb 7 14:47:48 CET 2005
Petr,
Thank you! Yes, rowSums appears to be even a little bit faster than
unique(which()), and it also maintains the original order. I do want
original order maintained, but I first apply a function to one of my
data.frames (that without any -9999s ... yes, these do represent nulls,
as someone asked earlier) and rbind these two dataframes back together,
so I need to sort (by rownames) after the rbind (there doesn't seem to
be a sortby option in rbind).
I apologize for not jumping on rowSums earlier, I hadn't caught on
that it was summing counts of occurrence of the search value, not
summing the search value itself.
Thanks again, this is very instructive and *very* helpful.
humbly,
Tim
>>> "Petr Pikal" <petr.pikal at precheza.cz> 02/07/05 02:12AM >>>
Hi Tim
I can not say much about apply, but the code with unique(which())
gives you reordered rows in case of -9999 selection
try
set.seed(1)
in.df <- data.frame(
c1=rnorm(40000),
c2=rnorm(40000),
c3=rnorm(40000),
c4=rnorm(40000),
c5=rnorm(40000))
in.df[in.df>3] <- (-9999)
system.time(e <- in.df[unique(which(in.df == -9999, arr.ind =
TRUE)[,1]), ])
system.time(e1 <- in.df[(rowSums(in.df == -9999)) != 0,])
all.equal(e,e1)
So if you mind you need to do reordering.
ooo<-order(as.numeric(rownames(e)))
all.equal(e[ooo,],e1)
Cheers
Petr
On 4 Feb 2005 at 11:17, Tim Howard wrote:
> Because I'll be doing this on big datasets and time is important, I
> thought I'd time all the different approaches that were suggested on
a
> small dataframe. The results were very instructive so I thought I'd
> pass them on. I also discovered that my numeric columns (e.g.
> -9999.000) weren't found by apply() but were found by which() and
the
> simple replace. Was it apply's fault or something else?
>
> Note how much faster unique(which()) is; wow! Thanks to Marc
Schwartz
> for this blazing solution.
>
> > nrow(in.df)
> [1] 40000
> #extract rows with no -9999
> > system.time(x <- subset(in.df, apply(in.df, 1,
> function(in.df){all(in.df != -9999)})))
> [1] 3.25 0.00 3.25 NA NA
> > system.time(y<- in.df[-unique(which(in.df == -9999, arr.ind =
> > TRUE)[,
> 1]), ])
> [1] 0.17 0.00 0.17 NA NA
> > system.time({is.na(in.df) <-in.df == -9999; z <- na.omit(in.df)})
> [1] 0.25 0.02 0.26 NA NA
>
> > nrow(x);nrow(y);nrow(z)
> [1] 39990
> [1] 39626
> [1] 39626
>
> #extract rows with -9999
> > system.time(d<-subset(in.df, apply(in.df, 1,
> function(in.df){any(in.df == -9999)})))
> [1] 3.40 0.00 3.45 NA NA
> > system.time(e<-in.df[unique(which(in.df == -9999, arr.ind =
TRUE)[,
> 1]), ])
> [1] 0.11 0.00 0.11 NA NA
>
> > nrow(d); nrow(e)
> [1] 10
> [1] 374
>
> Tim Howard
>
>
> >>> Marc Schwartz <MSchwartz at MedAnalytics.com> 02/03/05 03:24PM >>>
> On Thu, 2005-02-03 at 14:57 -0500, Tim Howard wrote:
> ... snip...
> > My questions:
> > Is there a cleaner way to extract all rows containing a specified
> > value? How can I extract all rows that don't have this value in
any
> > col?
> >
> > #create dummy dataset
> > x <- data.frame(
> > c1=c(-99,-99,-99,4:10),
> > c2=1:10,
> > c3=c(1:3,-99,5:10),
> > c4=c(10:1),
> > c5=c(1:9,-99))
> >
> ..snip...
>
> How about this, presuming that your data frame is all numeric:
>
> For rows containing -99:
>
> > x[unique(which(x == -99, arr.ind = TRUE)[, 1]), ]
> c1 c2 c3 c4 c5
> 1 -99 1 1 10 1
> 2 -99 2 2 9 2
> 3 -99 3 3 8 3
> 4 4 4 -99 7 4
> 10 10 10 10 1 -99
>
>
> For rows not containing -99:
>
> > x[-unique(which(x == -99, arr.ind = TRUE)[, 1]), ]
> c1 c2 c3 c4 c5
> 5 5 5 5 6 5
> 6 6 6 6 5 6
> 7 7 7 7 4 7
> 8 8 8 8 3 8
> 9 9 9 9 2 9
>
>
> What I have done here is to use which(), setting arr.ind = TRUE.
This
> returns the row, column indices for the matches to the boolean
> statement. The first column returned by which() in this case are the
> row numbers matching the statement, so I take the first column only.
>
> Since it is possible that more than one element in a row can match
the
> boolean, I then use unique() to get the singular row values.
>
> Thus, I can use the returned row indices above to subset the data
> frame.
>
> HTH,
>
> Marc Schwartz
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html
Petr Pikal
petr.pikal at precheza.cz
More information about the R-help
mailing list