[R] Subsetting data where the condition is that the value of some column contains some substring

Sat Mar 21 02:55:22 CET 2009

Aha, I get it now. I was under the mistaken, intuitive impression that
the subset condition was evaluated element-by-element... I guess it
must actually work out to a vector of booleans, each element of which
gets compared to the corresponding element of the data to be
subsetted. That is, in hindsight, perfectly in character for R.

-Max

On Fri, Mar 20, 2009 at 8:39 PM, jim holtman <jholtman at gmail.com> wrote:
> grep and regexpr return different values.  regexpr returns a vector of
> the same length as the input and this can be used to construct a
> logical subscript.  grep return a vector of only the matches, in which
> case you can have a length of zero if there are no matches.  Makes it
> harder to create the subsets.  You have to test for zero length and
> then do something special.
>
> On Fri, Mar 20, 2009 at 9:20 PM, Max Bane <max.bane at gmail.com> wrote:
>> Thanks, Jim (and Mark, who replied off-list) -- that does the trick. I
>> had tried using an index expression with grep, but that failed in the
>> same way as the subset method. It is still rather mysterious why this
>> works with regexpr but not with grep :)
>>
>> -Max
>>
>> On Fri, Mar 20, 2009 at 7:57 PM, jim holtman <jholtman at gmail.com> wrote:
>>> Try using regexpr instead:
>>>
>>>> x <- read.table(textConnection("input output corpusFreq pvolOT pvolRatioOT
>>> + give(mysister,theoldbook)      P       47.0  56016   0.1543651
>>> + donate(her,thebook)      P       48.7  68928   0.1899471
>>> + give(mysister,thebook)      P       73.4  80136   0.2208333
>>> + donate(mysister,theoldbook)      P       79.0  57024   0.1571429
>>> + give(mysister,it)      P      100.0 132408   0.3648810
>>> + give(her,it)      P      100.0 157248   0.4333333
>>> + donate(mysister,it)      P      100.0 130720   0.3602293
>>> + give(her,thebook)      P        5.7  65232   0.1797619
>>> + donate(her,it)      P      100.0 152064   0.4190476
>>> + give(mylittlesister,thebook)      P       91.8 112032   0.3087302
>>> + donate(mylittlesister,thebook)      P       98.4 114048   0.3142857
>>> + donate(mysister,thebook)      P       94.4  82800   0.2281746"), header=TRUE)
>>>> # use regexpr
>>>> matched <- regexpr("her", x$input) != -1
>>>> notMatched <- !matched
>>>> x[matched,]
>>>                input output corpusFreq pvolOT pvolRatioOT
>>> 2 donate(her,thebook)      P       48.7  68928   0.1899471
>>> 6        give(her,it)      P      100.0 157248   0.4333333
>>> 8   give(her,thebook)      P        5.7  65232   0.1797619
>>> 9      donate(her,it)      P      100.0 152064   0.4190476
>>>> x[notMatched,]
>>>                            input output corpusFreq pvolOT pvolRatioOT
>>> 1       give(mysister,theoldbook)      P       47.0  56016   0.1543651
>>> 3          give(mysister,thebook)      P       73.4  80136   0.2208333
>>> 4     donate(mysister,theoldbook)      P       79.0  57024   0.1571429
>>> 5               give(mysister,it)      P      100.0 132408   0.3648810
>>> 7             donate(mysister,it)      P      100.0 130720   0.3602293
>>> 10   give(mylittlesister,thebook)      P       91.8 112032   0.3087302
>>> 11 donate(mylittlesister,thebook)      P       98.4 114048   0.3142857
>>> 12       donate(mysister,thebook)      P       94.4  82800   0.2281746
>>>>
>>>>
>>>
>>>
>>> On Fri, Mar 20, 2009 at 8:25 PM, Max Bane <max.bane at gmail.com> wrote:
>>>> I have some data that looks like this:
>>>>
>>>>> dataP
>>>>                                input output corpusFreq pvolOT pvolRatioOT
>>>> 1       give(my sister, the old book)      P       47.0  56016   0.1543651
>>>> 5               donate(her, the book)      P       48.7  68928   0.1899471
>>>> 9           give(my sister, the book)      P       73.4  80136   0.2208333
>>>> 13    donate(my sister, the old book)      P       79.0  57024   0.1571429
>>>> 20                give(my sister, it)      P      100.0 132408   0.3648810
>>>> 21                      give(her, it)      P      100.0 157248   0.4333333
>>>> 24              donate(my sister, it)      P      100.0 130720   0.3602293
>>>> 28                give(her, the book)      P        5.7  65232   0.1797619
>>>> 31                    donate(her, it)      P      100.0 152064   0.4190476
>>>> 35   give(my little sister, the book)      P       91.8 112032   0.3087302
>>>> 39 donate(my little sister, the book)      P       98.4 114048   0.3142857
>>>> 43        donate(my sister, the book)      P       94.4  82800   0.2281746
>>>>
>>>> I would like to extract the subset of this data in which the value of
>>>> the "input" column contains the substring "her". I was thinking I
>>>> could use the grep function to test for the presence of this
>>>> substring. For instance, if a string does not contain it, then grep
>>>> returns a zero length integer vector:
>>>>
>>>>> grep("her", "give(my sister, it)")
>>>> integer(0)
>>>>
>>>> And if the string does contain the substring, grep returns a vector of
>>>> the indices where the substring is located:
>>>>
>>>>> grep("her", "give(her, it)")
>>>> [1] 1
>>>>
>>>> I can thus test for the presence of the substring by converting the
>>>> length of the result of grep into a boolean:
>>>>
>>>>> as.logical(length(grep("her", "give(my sister, it)")))
>>>> [1] FALSE
>>>>> as.logical(length(grep("her", "give(her, it)")))
>>>> [1] TRUE
>>>>> as.logical(length(grep("her", "give(her, it)"))) == TRUE
>>>> [1] TRUE
>>>>> as.logical(length(grep("her", "give(my sister, it)"))) == TRUE
>>>> [1] FALSE
>>>>
>>>> I would like to use this test as a criterion for constructing a subset
>>>> of my data. Unfortunately, it does not work:
>>>>
>>>>> subset(dataP, as.logical(length(grep("her", input)))==TRUE)
>>>>                                input output corpusFreq pvolOT pvolRatioOT
>>>> 1       give(my sister, the old book)      P       47.0  56016   0.1543651
>>>> 5               donate(her, the book)      P       48.7  68928   0.1899471
>>>> 9           give(my sister, the book)      P       73.4  80136   0.2208333
>>>> 13    donate(my sister, the old book)      P       79.0  57024   0.1571429
>>>> 20                give(my sister, it)      P      100.0 132408   0.3648810
>>>> 21                      give(her, it)      P      100.0 157248   0.4333333
>>>> 24              donate(my sister, it)      P      100.0 130720   0.3602293
>>>> 28                give(her, the book)      P        5.7  65232   0.1797619
>>>> 31                    donate(her, it)      P      100.0 152064   0.4190476
>>>> 35   give(my little sister, the book)      P       91.8 112032   0.3087302
>>>> 39 donate(my little sister, the book)      P       98.4 114048   0.3142857
>>>> 43        donate(my sister, the book)      P       94.4  82800   0.2281746
>>>>
>>>> As you can see, I get back the whole data set, rather than just the
>>>> subset where the input column contains "her". And if I invert the
>>>> test, which I would expect to give the subset *not* containing "her",
>>>> I instead get the empty subset, rather mysteriously:
>>>>
>>>>> subset(dataP, as.logical(length(grep("her", input)))==FALSE)
>>>> [1] input       output      corpusFreq  pvolOT      pvolRatioOT
>>>> <0 rows> (or 0-length row.names)
>>>>
>>>> The type of the input column is definitely character. To be double sure:
>>>>
>>>>> subset(dataP, as.logical(length(grep("her", as.character(input))))==TRUE)
>>>>
>>>> does the same thing.
>>>>
>>>> Could somebody with more R experience than I have please explain what
>>>> I am doing wrong here? I'll be much obliged.
>>>>
>>>> --
>>>> Max Bane
>>>> PhD Student, Linguistics
>>>> University of Chicago
>>>> bane at uchicago.edu
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>
>>>
>>>
>>> --
>>> Jim Holtman
>>> Cincinnati, OH
>>> +1 513 646 9390
>>>
>>> What is the problem that you are trying to solve?
>>>
>>
>>
>>
>> --
>> Max Bane
>> PhD Student, Linguistics
>> University of Chicago
>> bane at uchicago.edu
>>
>
>
>
> --
> Jim Holtman
> Cincinnati, OH
> +1 513 646 9390
>
> What is the problem that you are trying to solve?
>

-- 
Max Bane
PhD Student, Linguistics
University of Chicago
bane at uchicago.edu