[R] How to delete rows with specific values on all columns(variables)?

Tue Feb 22 01:11:43 CET 2011

> -----Original Message-----
> From: r-help-bounces at r-project.org 
> [mailto:r-help-bounces at r-project.org] On Behalf Of David Winsemius
> Sent: Monday, February 21, 2011 3:11 PM
> To: David Winsemius
> Cc: r-help at r-project.org; IgnacioQM
> Subject: Re: [R] How to delete rows with specific values on 
> all columns(variables)?
> 
> 
> On Feb 21, 2011, at 6:05 PM, David Winsemius wrote:
> 
> >
> > On Feb 21, 2011, at 4:03 PM, IgnacioQM wrote:
> >
> >> I need to filter my data:
> >> I think its easy but i'm stuck so i'll appreciate some help:
> >>
> >> I have a data frame with 14 variables and 6 million rows. About  
> >> half of this
> >> rows have a value of "0" in 12 variables (the other two variables  
> >> always
> >> have values). How can I delete the rows in which all 12 variables  
> >> have the
> >> value of "0".
> >>
> >> example (from my data, variable 14 is missing):
> >>
> >> 1783         81      85      78     89     71     97     76      
> >> 66     88
> >> 95     95     98  -57.48258
> >> 1784         81      86      79     90     71     97     77      
> >> 66     88
> >> 95     95     98  -57.43768
> >> 1785         81      86      79     90     71     98     77      
> >> 66     89
> >> 95     94     98  -57.39278
> >> 1786          0       0       0      0      0      0      0       
> >> 0      0
> >> 0      0      0  -57.34788
> >> 1787          0       0       0      0      0      0      0       
> >> 0      0
> >> 0      0      0  -57.30298
> >> 1788         80      86      80     90     72     98     78      
> >> 66     88
> >> 93     93     96  -57.25808
> >> 1789         77      83      78     88     70     95     76      
> >> 63     86
> >> 91     90     93  -57.21318
> >> 1790         77      84      79     89     70     96     76      
> >> 64     87
> >> 91     90     93  -57.16828
> >>
> >> I would need to delete rows 1786 & 1787.
> >
> > something along the lines of:
> >
> > dfrm[ -apply(dfrm, 1, function(x) all(x==0) ), ]
> 
> Looking at a second time, I see the qualification of only the 
> first 12  
> rows, so
> 
> dfrm[ -apply(dfrm[, 1:12], 1, function(x) all(x==0) ), ]

I think you want !apply, not -apply, as in
  f0 <- function (dfrm) {
      dfrm[!apply(dfrm[, 1:12], 1, function(x) all(x == 0)), ]
  }

> 
> Email obscured that compounded by the fact that you didn't post a  
> reproducible data object.

A faster and safer way would be to operate a column
at a time (faster when there are many more rows than
colummns) and to avoid apply (safer, as it turns
the data.frame into a matrix whose storage.mode
might surprise you and lead to errors in the x==0 test).
E.g.,

  f1 <- function (dfrm) {
      isZero <- function(x) !is.na(x) & x == 0
      areAllColsZero <- isZero(dfrm[, 1])
      for (col in dfrm[, 2:12]) areAllColsZero <- areAllColsZero & 
          isZero(col)
      dfrm[!areAllColsZero, , drop = FALSE]
  }
You can use Reduce() instead of the loop, but the loop code
is easy to understand.

I made some fake data with the following function, which
makes all but 12 rows be all-zero:
  makeData <- function (nrow) {
      rowNum <- seq_len(nrow)
      data.frame(lapply(structure(1:12, names = paste("X", 1:12, 
          sep = "")), function(i) as.integer(rowNum == (2 * i))), 
          Data1 = 1:nrow, Data2 = sqrt(1:nrow))
  }

To test this out:
  > dfrm <- makeData(1e6) # million rows, 12 to keep
  > system.time(r0 <- f0(dfrm))
     user  system elapsed 
    21.45    0.60   22.55 
  > system.time(r1 <- f1(dfrm)) # faster
     user  system elapsed 
     0.87    0.07    0.88 
  > identical(r0, r1) # gives same results
  [1] TRUE
  > dim(r0)
  [1] 12 14
  > r0
     X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 Data1    Data2
  2   1  0  0  0  0  0  0  0  0   0   0   0     2 1.414214
  4   0  1  0  0  0  0  0  0  0   0   0   0     4 2.000000
  ...

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com 

> >
> >>
> >> I tried subset with variable1>"0"&variable2>"0", but it wasn't  
> >> useful 'cause
> >> it only took the rows that didn't have a 0 in any of the 
> variables;  
> >> I only
> >> need in ALL of the variables simultaneously.
> >>
> >> Thanks,
> >>
> >> Ignacio
> >
> 
> 
> David Winsemius, MD
> West Hartford, CT
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>