[R] Subsetting data where the condition is that the value of some column contains some substring
jim holtman
jholtman at gmail.com
Sat Mar 21 01:57:03 CET 2009
Try using regexpr instead:
> x <- read.table(textConnection("input output corpusFreq pvolOT pvolRatioOT
+ give(mysister,theoldbook) P 47.0 56016 0.1543651
+ donate(her,thebook) P 48.7 68928 0.1899471
+ give(mysister,thebook) P 73.4 80136 0.2208333
+ donate(mysister,theoldbook) P 79.0 57024 0.1571429
+ give(mysister,it) P 100.0 132408 0.3648810
+ give(her,it) P 100.0 157248 0.4333333
+ donate(mysister,it) P 100.0 130720 0.3602293
+ give(her,thebook) P 5.7 65232 0.1797619
+ donate(her,it) P 100.0 152064 0.4190476
+ give(mylittlesister,thebook) P 91.8 112032 0.3087302
+ donate(mylittlesister,thebook) P 98.4 114048 0.3142857
+ donate(mysister,thebook) P 94.4 82800 0.2281746"), header=TRUE)
> # use regexpr
> matched <- regexpr("her", x$input) != -1
> notMatched <- !matched
> x[matched,]
input output corpusFreq pvolOT pvolRatioOT
2 donate(her,thebook) P 48.7 68928 0.1899471
6 give(her,it) P 100.0 157248 0.4333333
8 give(her,thebook) P 5.7 65232 0.1797619
9 donate(her,it) P 100.0 152064 0.4190476
> x[notMatched,]
input output corpusFreq pvolOT pvolRatioOT
1 give(mysister,theoldbook) P 47.0 56016 0.1543651
3 give(mysister,thebook) P 73.4 80136 0.2208333
4 donate(mysister,theoldbook) P 79.0 57024 0.1571429
5 give(mysister,it) P 100.0 132408 0.3648810
7 donate(mysister,it) P 100.0 130720 0.3602293
10 give(mylittlesister,thebook) P 91.8 112032 0.3087302
11 donate(mylittlesister,thebook) P 98.4 114048 0.3142857
12 donate(mysister,thebook) P 94.4 82800 0.2281746
>
>
On Fri, Mar 20, 2009 at 8:25 PM, Max Bane <max.bane at gmail.com> wrote:
> I have some data that looks like this:
>
>> dataP
> input output corpusFreq pvolOT pvolRatioOT
> 1 give(my sister, the old book) P 47.0 56016 0.1543651
> 5 donate(her, the book) P 48.7 68928 0.1899471
> 9 give(my sister, the book) P 73.4 80136 0.2208333
> 13 donate(my sister, the old book) P 79.0 57024 0.1571429
> 20 give(my sister, it) P 100.0 132408 0.3648810
> 21 give(her, it) P 100.0 157248 0.4333333
> 24 donate(my sister, it) P 100.0 130720 0.3602293
> 28 give(her, the book) P 5.7 65232 0.1797619
> 31 donate(her, it) P 100.0 152064 0.4190476
> 35 give(my little sister, the book) P 91.8 112032 0.3087302
> 39 donate(my little sister, the book) P 98.4 114048 0.3142857
> 43 donate(my sister, the book) P 94.4 82800 0.2281746
>
> I would like to extract the subset of this data in which the value of
> the "input" column contains the substring "her". I was thinking I
> could use the grep function to test for the presence of this
> substring. For instance, if a string does not contain it, then grep
> returns a zero length integer vector:
>
>> grep("her", "give(my sister, it)")
> integer(0)
>
> And if the string does contain the substring, grep returns a vector of
> the indices where the substring is located:
>
>> grep("her", "give(her, it)")
> [1] 1
>
> I can thus test for the presence of the substring by converting the
> length of the result of grep into a boolean:
>
>> as.logical(length(grep("her", "give(my sister, it)")))
> [1] FALSE
>> as.logical(length(grep("her", "give(her, it)")))
> [1] TRUE
>> as.logical(length(grep("her", "give(her, it)"))) == TRUE
> [1] TRUE
>> as.logical(length(grep("her", "give(my sister, it)"))) == TRUE
> [1] FALSE
>
> I would like to use this test as a criterion for constructing a subset
> of my data. Unfortunately, it does not work:
>
>> subset(dataP, as.logical(length(grep("her", input)))==TRUE)
> input output corpusFreq pvolOT pvolRatioOT
> 1 give(my sister, the old book) P 47.0 56016 0.1543651
> 5 donate(her, the book) P 48.7 68928 0.1899471
> 9 give(my sister, the book) P 73.4 80136 0.2208333
> 13 donate(my sister, the old book) P 79.0 57024 0.1571429
> 20 give(my sister, it) P 100.0 132408 0.3648810
> 21 give(her, it) P 100.0 157248 0.4333333
> 24 donate(my sister, it) P 100.0 130720 0.3602293
> 28 give(her, the book) P 5.7 65232 0.1797619
> 31 donate(her, it) P 100.0 152064 0.4190476
> 35 give(my little sister, the book) P 91.8 112032 0.3087302
> 39 donate(my little sister, the book) P 98.4 114048 0.3142857
> 43 donate(my sister, the book) P 94.4 82800 0.2281746
>
> As you can see, I get back the whole data set, rather than just the
> subset where the input column contains "her". And if I invert the
> test, which I would expect to give the subset *not* containing "her",
> I instead get the empty subset, rather mysteriously:
>
>> subset(dataP, as.logical(length(grep("her", input)))==FALSE)
> [1] input output corpusFreq pvolOT pvolRatioOT
> <0 rows> (or 0-length row.names)
>
> The type of the input column is definitely character. To be double sure:
>
>> subset(dataP, as.logical(length(grep("her", as.character(input))))==TRUE)
>
> does the same thing.
>
> Could somebody with more R experience than I have please explain what
> I am doing wrong here? I'll be much obliged.
>
> --
> Max Bane
> PhD Student, Linguistics
> University of Chicago
> bane at uchicago.edu
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
--
Jim Holtman
Cincinnati, OH
+1 513 646 9390
What is the problem that you are trying to solve?
More information about the R-help
mailing list