[R] Subsetting data where the condition is that the value of some column contains some substring
Max Bane
max.bane at gmail.com
Sat Mar 21 01:25:39 CET 2009
I have some data that looks like this:
> dataP
input output corpusFreq pvolOT pvolRatioOT
1 give(my sister, the old book) P 47.0 56016 0.1543651
5 donate(her, the book) P 48.7 68928 0.1899471
9 give(my sister, the book) P 73.4 80136 0.2208333
13 donate(my sister, the old book) P 79.0 57024 0.1571429
20 give(my sister, it) P 100.0 132408 0.3648810
21 give(her, it) P 100.0 157248 0.4333333
24 donate(my sister, it) P 100.0 130720 0.3602293
28 give(her, the book) P 5.7 65232 0.1797619
31 donate(her, it) P 100.0 152064 0.4190476
35 give(my little sister, the book) P 91.8 112032 0.3087302
39 donate(my little sister, the book) P 98.4 114048 0.3142857
43 donate(my sister, the book) P 94.4 82800 0.2281746
I would like to extract the subset of this data in which the value of
the "input" column contains the substring "her". I was thinking I
could use the grep function to test for the presence of this
substring. For instance, if a string does not contain it, then grep
returns a zero length integer vector:
> grep("her", "give(my sister, it)")
integer(0)
And if the string does contain the substring, grep returns a vector of
the indices where the substring is located:
> grep("her", "give(her, it)")
[1] 1
I can thus test for the presence of the substring by converting the
length of the result of grep into a boolean:
> as.logical(length(grep("her", "give(my sister, it)")))
[1] FALSE
> as.logical(length(grep("her", "give(her, it)")))
[1] TRUE
> as.logical(length(grep("her", "give(her, it)"))) == TRUE
[1] TRUE
> as.logical(length(grep("her", "give(my sister, it)"))) == TRUE
[1] FALSE
I would like to use this test as a criterion for constructing a subset
of my data. Unfortunately, it does not work:
> subset(dataP, as.logical(length(grep("her", input)))==TRUE)
input output corpusFreq pvolOT pvolRatioOT
1 give(my sister, the old book) P 47.0 56016 0.1543651
5 donate(her, the book) P 48.7 68928 0.1899471
9 give(my sister, the book) P 73.4 80136 0.2208333
13 donate(my sister, the old book) P 79.0 57024 0.1571429
20 give(my sister, it) P 100.0 132408 0.3648810
21 give(her, it) P 100.0 157248 0.4333333
24 donate(my sister, it) P 100.0 130720 0.3602293
28 give(her, the book) P 5.7 65232 0.1797619
31 donate(her, it) P 100.0 152064 0.4190476
35 give(my little sister, the book) P 91.8 112032 0.3087302
39 donate(my little sister, the book) P 98.4 114048 0.3142857
43 donate(my sister, the book) P 94.4 82800 0.2281746
As you can see, I get back the whole data set, rather than just the
subset where the input column contains "her". And if I invert the
test, which I would expect to give the subset *not* containing "her",
I instead get the empty subset, rather mysteriously:
> subset(dataP, as.logical(length(grep("her", input)))==FALSE)
[1] input output corpusFreq pvolOT pvolRatioOT
<0 rows> (or 0-length row.names)
The type of the input column is definitely character. To be double sure:
> subset(dataP, as.logical(length(grep("her", as.character(input))))==TRUE)
does the same thing.
Could somebody with more R experience than I have please explain what
I am doing wrong here? I'll be much obliged.
--
Max Bane
PhD Student, Linguistics
University of Chicago
bane at uchicago.edu
More information about the R-help
mailing list