[R] Subsetting data where the condition is that the value of some column contains some substring

Sat Mar 21 02:20:24 CET 2009

Thanks, Jim (and Mark, who replied off-list) -- that does the trick. I
had tried using an index expression with grep, but that failed in the
same way as the subset method. It is still rather mysterious why this
works with regexpr but not with grep :)

-Max

On Fri, Mar 20, 2009 at 7:57 PM, jim holtman <jholtman at gmail.com> wrote:
> Try using regexpr instead:
>
>> x <- read.table(textConnection("input output corpusFreq pvolOT pvolRatioOT
> + give(mysister,theoldbook)      P       47.0  56016   0.1543651
> + donate(her,thebook)      P       48.7  68928   0.1899471
> + give(mysister,thebook)      P       73.4  80136   0.2208333
> + donate(mysister,theoldbook)      P       79.0  57024   0.1571429
> + give(mysister,it)      P      100.0 132408   0.3648810
> + give(her,it)      P      100.0 157248   0.4333333
> + donate(mysister,it)      P      100.0 130720   0.3602293
> + give(her,thebook)      P        5.7  65232   0.1797619
> + donate(her,it)      P      100.0 152064   0.4190476
> + give(mylittlesister,thebook)      P       91.8 112032   0.3087302
> + donate(mylittlesister,thebook)      P       98.4 114048   0.3142857
> + donate(mysister,thebook)      P       94.4  82800   0.2281746"), header=TRUE)
>> # use regexpr
>> matched <- regexpr("her", x$input) != -1
>> notMatched <- !matched
>> x[matched,]
>                input output corpusFreq pvolOT pvolRatioOT
> 2 donate(her,thebook)      P       48.7  68928   0.1899471
> 6        give(her,it)      P      100.0 157248   0.4333333
> 8   give(her,thebook)      P        5.7  65232   0.1797619
> 9      donate(her,it)      P      100.0 152064   0.4190476
>> x[notMatched,]
>                            input output corpusFreq pvolOT pvolRatioOT
> 1       give(mysister,theoldbook)      P       47.0  56016   0.1543651
> 3          give(mysister,thebook)      P       73.4  80136   0.2208333
> 4     donate(mysister,theoldbook)      P       79.0  57024   0.1571429
> 5               give(mysister,it)      P      100.0 132408   0.3648810
> 7             donate(mysister,it)      P      100.0 130720   0.3602293
> 10   give(mylittlesister,thebook)      P       91.8 112032   0.3087302
> 11 donate(mylittlesister,thebook)      P       98.4 114048   0.3142857
> 12       donate(mysister,thebook)      P       94.4  82800   0.2281746
>>
>>
>
>
> On Fri, Mar 20, 2009 at 8:25 PM, Max Bane <max.bane at gmail.com> wrote:
>> I have some data that looks like this:
>>
>>> dataP
>>                                input output corpusFreq pvolOT pvolRatioOT
>> 1       give(my sister, the old book)      P       47.0  56016   0.1543651
>> 5               donate(her, the book)      P       48.7  68928   0.1899471
>> 9           give(my sister, the book)      P       73.4  80136   0.2208333
>> 13    donate(my sister, the old book)      P       79.0  57024   0.1571429
>> 20                give(my sister, it)      P      100.0 132408   0.3648810
>> 21                      give(her, it)      P      100.0 157248   0.4333333
>> 24              donate(my sister, it)      P      100.0 130720   0.3602293
>> 28                give(her, the book)      P        5.7  65232   0.1797619
>> 31                    donate(her, it)      P      100.0 152064   0.4190476
>> 35   give(my little sister, the book)      P       91.8 112032   0.3087302
>> 39 donate(my little sister, the book)      P       98.4 114048   0.3142857
>> 43        donate(my sister, the book)      P       94.4  82800   0.2281746
>>
>> I would like to extract the subset of this data in which the value of
>> the "input" column contains the substring "her". I was thinking I
>> could use the grep function to test for the presence of this
>> substring. For instance, if a string does not contain it, then grep
>> returns a zero length integer vector:
>>
>>> grep("her", "give(my sister, it)")
>> integer(0)
>>
>> And if the string does contain the substring, grep returns a vector of
>> the indices where the substring is located:
>>
>>> grep("her", "give(her, it)")
>> [1] 1
>>
>> I can thus test for the presence of the substring by converting the
>> length of the result of grep into a boolean:
>>
>>> as.logical(length(grep("her", "give(my sister, it)")))
>> [1] FALSE
>>> as.logical(length(grep("her", "give(her, it)")))
>> [1] TRUE
>>> as.logical(length(grep("her", "give(her, it)"))) == TRUE
>> [1] TRUE
>>> as.logical(length(grep("her", "give(my sister, it)"))) == TRUE
>> [1] FALSE
>>
>> I would like to use this test as a criterion for constructing a subset
>> of my data. Unfortunately, it does not work:
>>
>>> subset(dataP, as.logical(length(grep("her", input)))==TRUE)
>>                                input output corpusFreq pvolOT pvolRatioOT
>> 1       give(my sister, the old book)      P       47.0  56016   0.1543651
>> 5               donate(her, the book)      P       48.7  68928   0.1899471
>> 9           give(my sister, the book)      P       73.4  80136   0.2208333
>> 13    donate(my sister, the old book)      P       79.0  57024   0.1571429
>> 20                give(my sister, it)      P      100.0 132408   0.3648810
>> 21                      give(her, it)      P      100.0 157248   0.4333333
>> 24              donate(my sister, it)      P      100.0 130720   0.3602293
>> 28                give(her, the book)      P        5.7  65232   0.1797619
>> 31                    donate(her, it)      P      100.0 152064   0.4190476
>> 35   give(my little sister, the book)      P       91.8 112032   0.3087302
>> 39 donate(my little sister, the book)      P       98.4 114048   0.3142857
>> 43        donate(my sister, the book)      P       94.4  82800   0.2281746
>>
>> As you can see, I get back the whole data set, rather than just the
>> subset where the input column contains "her". And if I invert the
>> test, which I would expect to give the subset *not* containing "her",
>> I instead get the empty subset, rather mysteriously:
>>
>>> subset(dataP, as.logical(length(grep("her", input)))==FALSE)
>> [1] input       output      corpusFreq  pvolOT      pvolRatioOT
>> <0 rows> (or 0-length row.names)
>>
>> The type of the input column is definitely character. To be double sure:
>>
>>> subset(dataP, as.logical(length(grep("her", as.character(input))))==TRUE)
>>
>> does the same thing.
>>
>> Could somebody with more R experience than I have please explain what
>> I am doing wrong here? I'll be much obliged.
>>
>> --
>> Max Bane
>> PhD Student, Linguistics
>> University of Chicago
>> bane at uchicago.edu
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
>
> --
> Jim Holtman
> Cincinnati, OH
> +1 513 646 9390
>
> What is the problem that you are trying to solve?
>

-- 
Max Bane
PhD Student, Linguistics
University of Chicago
bane at uchicago.edu