[R] can not extract rows which match a string
hp@ge@ @end|ng |rom |redhutch@org
Thu Oct 3 23:32:43 CEST 2019
On 10/3/19 11:58, Ana Marija wrote:
> I have a dataframe (t1) with many columns, but the one I care about it this:
>  NA "Yes"
> it has these two values.
> I would like to remove from my dataframe t1 all rows which have "Yes"
> in t1$sex_chromosome_aneuploidy_f22019_0_0
> I tried selecting those rows with "Yes" via:
It's important that you realize that instead of removing rows with "Yes"
this actually keeps them.
> but I got t11 which has the exact same number of rows as t1.
which should not be outrageously unexpected. After all it's not entirely
impossible that when you selected the rows with "Yes" you selected them all.
> If I do:
> So there is for sure 620 rows which have "Yes".
This **seems** to indicate that all the rows contain "Yes". And this
would explain why when you selected the rows with "Yes" you selected
> How to remove those
> from my t1 data frame?
Unfortunately, this is a situation where we cannot trust the appearances.
Appearances: it **looks** like all the rows contain "Yes" and this seems
to be confirmed by the fact that selecting the rows with "Yes" didn't
drop any rows.
The truth: the truth is that there are some rows that don't contain
"Yes". However by default table() doesn't report counts for NAs so you
need to explicitly ask for that:
> table(t1$sex_chromosome_aneuploidy_f22019_0_0, useNA="always")
So now you know how many rows to expect after removing those with "Yes".
Another complication is that the == operator propagates NAs so it tends
to return a subscript that is not safe to use for subsetting because
it's contaminated with NAs.
Other people have suggested that you use
is.na(t1$sex_chromosome_aneuploidy_f22019_0_0) or other more complicated
things (like t1$sex_chromosome_aneuploidy_f22019_0_0 != "Yes" &
is.na(t1$sex_chromosome_aneuploidy_f22019_0_0)) to work around this.
However the simplest and safest way to translate "compute the index of
the rows that match string 'babar'" into R code is with:
t1$sex_chromosome_aneuploidy_f22019_0_0 %in% "babar"
Another advantage of using %in% is that you can have more than one
string on the right. For example
t1$sex_chromosome_aneuploidy_f22019_0_0 %in% c("babar", "foo")
will produce an index that can be used to select the rows that match
"babar" or "foo". To remove these rows, use
!(t1$sex_chromosome_aneuploidy_f22019_0_0 %in% c("babar", "foo"))
instead (parenthesis around the %in% operation highly recommended for
The bottom line is that %in% is almost always better than == for
computing a subscript because it doesn't propagate NAs.
Hope this helps,
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> PLEASE do read the posting guide https://urldefense.proofpoint.com/v2/url?u=http-3A__www.R-2Dproject.org_posting-2Dguide.html&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=-q949hHmNa2Zy6QlxHGK0kwN06YpOLpQaCPLdbT448o&s=m_46Zit63H4OkJrgOFPzWqqdpgHNvW8B5jC0Rw9O1h4&e=
> and provide commented, minimal, self-contained, reproducible code.
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages using fredhutch.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the R-help