[R] Deleting rows with special character

Fri Nov 16 19:40:44 CET 2012

On Nov 16, 2012, at 8:26 AM, Sarah Goslee <sarah.goslee at gmail.com> wrote:

> Hi Peter,
> 
> On Fri, Nov 16, 2012 at 9:04 AM, Peter Kupfer <peter.kupfer at me.com> wrote:
>> Dear all,
>> maybe a simple problem but I found no solution for my problem.
>> I have a matrix Y with 23 000 rows and 220 colums. The entries are "A", "B" or "C".
> 
> A reproducible example with sample data is helpful.
> 
>> I want to extract all rows (as a matrix ) of the matrix Y where all entries of a row are (for example) "A".
> 
> Really? Why not just make a new matrix with the right number of "A" values?
> 
>> Is there any solution? I tried the stringr- package but i doesn't work out.
> 
> Of course there is. Here's one option. But I'm not sure you've really
> stated your actual problem. This extracts the rows where all values
> are "A", and might at least get you started toward your real problem.
> 
> testdata <- matrix(c(
> "A", "B", "C",
> "B", "B", "B",
> "C", "A", "A",
> "A", "A", "A"),
> ncol=3, byrow=TRUE)
> 
> testdata.A <- testdata[apply(testdata, 1, function(x)all(x == "A")), ,
> drop=FALSE]

Using something like rowSums() might be faster in this case, based upon brief testing. 

Since using a boolean returns TRUE/FALSE, which have numeric equivalent values of 1/0, respectively, you can subset the matrix based upon the rowSums() values being equal to the number of columns in the matrix, which indicates that all values in the row match your desired value.

# Create a 230000 * 220 matrix with random values.
set.seed(1)
testdata <- matrix(sample(c("A", "B", "C"), 23000*220, replace = TRUE), ncol = 220)

# Set 100 random rows to all "A"s
set.seed(2)
testdata[sample(23000, 100), ] <- rep("A", 220)

> system.time(Sub1 <-testdata[apply(testdata, 1, function(x)all(x == "A")), ,drop = FALSE])
   user  system elapsed 
  0.454   0.047   0.503 

> system.time(Sub2 <- testdata[rowSums(testdata == "A") == ncol(testdata), , drop = FALSE])
   user  system elapsed 
  0.089   0.001   0.090 

> str(Sub1)
 chr [1:100, 1:220] "A" "A" "A" "A" "A" "A" "A" "A" ...

> str(Sub2)
 chr [1:100, 1:220] "A" "A" "A" "A" "A" "A" "A" "A" ...

> identical(Sub1, Sub2)
[1] TRUE

See ?rowSums, which uses a .Internal, so is fast code.

Regards,

Marc Schwartz