[R] Subsetting data where the condition is that the value of some column contains some substring

Sat Mar 21 01:57:03 CET 2009

Try using regexpr instead:

> x <- read.table(textConnection("input output corpusFreq pvolOT pvolRatioOT
+ give(mysister,theoldbook)      P       47.0  56016   0.1543651
+ donate(her,thebook)      P       48.7  68928   0.1899471
+ give(mysister,thebook)      P       73.4  80136   0.2208333
+ donate(mysister,theoldbook)      P       79.0  57024   0.1571429
+ give(mysister,it)      P      100.0 132408   0.3648810
+ give(her,it)      P      100.0 157248   0.4333333
+ donate(mysister,it)      P      100.0 130720   0.3602293
+ give(her,thebook)      P        5.7  65232   0.1797619
+ donate(her,it)      P      100.0 152064   0.4190476
+ give(mylittlesister,thebook)      P       91.8 112032   0.3087302
+ donate(mylittlesister,thebook)      P       98.4 114048   0.3142857
+ donate(mysister,thebook)      P       94.4  82800   0.2281746"), header=TRUE)
> # use regexpr
> matched <- regexpr("her", x$input) != -1
> notMatched <- !matched
> x[matched,]
                input output corpusFreq pvolOT pvolRatioOT
2 donate(her,thebook)      P       48.7  68928   0.1899471
6        give(her,it)      P      100.0 157248   0.4333333
8   give(her,thebook)      P        5.7  65232   0.1797619
9      donate(her,it)      P      100.0 152064   0.4190476
> x[notMatched,]
                            input output corpusFreq pvolOT pvolRatioOT
1       give(mysister,theoldbook)      P       47.0  56016   0.1543651
3          give(mysister,thebook)      P       73.4  80136   0.2208333
4     donate(mysister,theoldbook)      P       79.0  57024   0.1571429
5               give(mysister,it)      P      100.0 132408   0.3648810
7             donate(mysister,it)      P      100.0 130720   0.3602293
10   give(mylittlesister,thebook)      P       91.8 112032   0.3087302
11 donate(mylittlesister,thebook)      P       98.4 114048   0.3142857
12       donate(mysister,thebook)      P       94.4  82800   0.2281746
>
>

On Fri, Mar 20, 2009 at 8:25 PM, Max Bane <max.bane at gmail.com> wrote:
> I have some data that looks like this:
>
>> dataP
>                                input output corpusFreq pvolOT pvolRatioOT
> 1       give(my sister, the old book)      P       47.0  56016   0.1543651
> 5               donate(her, the book)      P       48.7  68928   0.1899471
> 9           give(my sister, the book)      P       73.4  80136   0.2208333
> 13    donate(my sister, the old book)      P       79.0  57024   0.1571429
> 20                give(my sister, it)      P      100.0 132408   0.3648810
> 21                      give(her, it)      P      100.0 157248   0.4333333
> 24              donate(my sister, it)      P      100.0 130720   0.3602293
> 28                give(her, the book)      P        5.7  65232   0.1797619
> 31                    donate(her, it)      P      100.0 152064   0.4190476
> 35   give(my little sister, the book)      P       91.8 112032   0.3087302
> 39 donate(my little sister, the book)      P       98.4 114048   0.3142857
> 43        donate(my sister, the book)      P       94.4  82800   0.2281746
>
> I would like to extract the subset of this data in which the value of
> the "input" column contains the substring "her". I was thinking I
> could use the grep function to test for the presence of this
> substring. For instance, if a string does not contain it, then grep
> returns a zero length integer vector:
>
>> grep("her", "give(my sister, it)")
> integer(0)
>
> And if the string does contain the substring, grep returns a vector of
> the indices where the substring is located:
>
>> grep("her", "give(her, it)")
> [1] 1
>
> I can thus test for the presence of the substring by converting the
> length of the result of grep into a boolean:
>
>> as.logical(length(grep("her", "give(my sister, it)")))
> [1] FALSE
>> as.logical(length(grep("her", "give(her, it)")))
> [1] TRUE
>> as.logical(length(grep("her", "give(her, it)"))) == TRUE
> [1] TRUE
>> as.logical(length(grep("her", "give(my sister, it)"))) == TRUE
> [1] FALSE
>
> I would like to use this test as a criterion for constructing a subset
> of my data. Unfortunately, it does not work:
>
>> subset(dataP, as.logical(length(grep("her", input)))==TRUE)
>                                input output corpusFreq pvolOT pvolRatioOT
> 1       give(my sister, the old book)      P       47.0  56016   0.1543651
> 5               donate(her, the book)      P       48.7  68928   0.1899471
> 9           give(my sister, the book)      P       73.4  80136   0.2208333
> 13    donate(my sister, the old book)      P       79.0  57024   0.1571429
> 20                give(my sister, it)      P      100.0 132408   0.3648810
> 21                      give(her, it)      P      100.0 157248   0.4333333
> 24              donate(my sister, it)      P      100.0 130720   0.3602293
> 28                give(her, the book)      P        5.7  65232   0.1797619
> 31                    donate(her, it)      P      100.0 152064   0.4190476
> 35   give(my little sister, the book)      P       91.8 112032   0.3087302
> 39 donate(my little sister, the book)      P       98.4 114048   0.3142857
> 43        donate(my sister, the book)      P       94.4  82800   0.2281746
>
> As you can see, I get back the whole data set, rather than just the
> subset where the input column contains "her". And if I invert the
> test, which I would expect to give the subset *not* containing "her",
> I instead get the empty subset, rather mysteriously:
>
>> subset(dataP, as.logical(length(grep("her", input)))==FALSE)
> [1] input       output      corpusFreq  pvolOT      pvolRatioOT
> <0 rows> (or 0-length row.names)
>
> The type of the input column is definitely character. To be double sure:
>
>> subset(dataP, as.logical(length(grep("her", as.character(input))))==TRUE)
>
> does the same thing.
>
> Could somebody with more R experience than I have please explain what
> I am doing wrong here? I'll be much obliged.
>
> --
> Max Bane
> PhD Student, Linguistics
> University of Chicago
> bane at uchicago.edu
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?