[R] subset data.frame with value != in all columns
Tim Howard
tghoward at gw.dec.state.ny.us
Fri Feb 4 17:17:40 CET 2005
Because I'll be doing this on big datasets and time is important, I
thought I'd time all the different approaches that were suggested on a
small dataframe. The results were very instructive so I thought I'd pass
them on. I also discovered that my numeric columns (e.g. -9999.000)
weren't found by apply() but were found by which() and the simple
replace. Was it apply's fault or something else?
Note how much faster unique(which()) is; wow! Thanks to Marc Schwartz
for this blazing solution.
> nrow(in.df)
[1] 40000
#extract rows with no -9999
> system.time(x <- subset(in.df, apply(in.df, 1,
function(in.df){all(in.df != -9999)})))
[1] 3.25 0.00 3.25 NA NA
> system.time(y<- in.df[-unique(which(in.df == -9999, arr.ind = TRUE)[,
1]), ])
[1] 0.17 0.00 0.17 NA NA
> system.time({is.na(in.df) <-in.df == -9999; z <- na.omit(in.df)})
[1] 0.25 0.02 0.26 NA NA
> nrow(x);nrow(y);nrow(z)
[1] 39990
[1] 39626
[1] 39626
#extract rows with -9999
> system.time(d<-subset(in.df, apply(in.df, 1,
function(in.df){any(in.df == -9999)})))
[1] 3.40 0.00 3.45 NA NA
> system.time(e<-in.df[unique(which(in.df == -9999, arr.ind = TRUE)[,
1]), ])
[1] 0.11 0.00 0.11 NA NA
> nrow(d); nrow(e)
[1] 10
[1] 374
Tim Howard
>>> Marc Schwartz <MSchwartz at MedAnalytics.com> 02/03/05 03:24PM >>>
On Thu, 2005-02-03 at 14:57 -0500, Tim Howard wrote:
... snip...
> My questions:
> Is there a cleaner way to extract all rows containing a specified
> value?
> How can I extract all rows that don't have this value in any col?
>
> #create dummy dataset
> x <- data.frame(
> c1=c(-99,-99,-99,4:10),
> c2=1:10,
> c3=c(1:3,-99,5:10),
> c4=c(10:1),
> c5=c(1:9,-99))
>
..snip...
How about this, presuming that your data frame is all numeric:
For rows containing -99:
> x[unique(which(x == -99, arr.ind = TRUE)[, 1]), ]
c1 c2 c3 c4 c5
1 -99 1 1 10 1
2 -99 2 2 9 2
3 -99 3 3 8 3
4 4 4 -99 7 4
10 10 10 10 1 -99
For rows not containing -99:
> x[-unique(which(x == -99, arr.ind = TRUE)[, 1]), ]
c1 c2 c3 c4 c5
5 5 5 5 6 5
6 6 6 6 5 6
7 7 7 7 4 7
8 8 8 8 3 8
9 9 9 9 2 9
What I have done here is to use which(), setting arr.ind = TRUE. This
returns the row, column indices for the matches to the boolean
statement. The first column returned by which() in this case are the
row
numbers matching the statement, so I take the first column only.
Since it is possible that more than one element in a row can match the
boolean, I then use unique() to get the singular row values.
Thus, I can use the returned row indices above to subset the data
frame.
HTH,
Marc Schwartz
More information about the R-help
mailing list