[R] Subsetting data where the condition is that the value of some column contains some substring
Max Bane
max.bane at gmail.com
Sat Mar 21 02:55:22 CET 2009
Aha, I get it now. I was under the mistaken, intuitive impression that
the subset condition was evaluated element-by-element... I guess it
must actually work out to a vector of booleans, each element of which
gets compared to the corresponding element of the data to be
subsetted. That is, in hindsight, perfectly in character for R.
-Max
On Fri, Mar 20, 2009 at 8:39 PM, jim holtman <jholtman at gmail.com> wrote:
> grep and regexpr return different values. regexpr returns a vector of
> the same length as the input and this can be used to construct a
> logical subscript. grep return a vector of only the matches, in which
> case you can have a length of zero if there are no matches. Makes it
> harder to create the subsets. You have to test for zero length and
> then do something special.
>
> On Fri, Mar 20, 2009 at 9:20 PM, Max Bane <max.bane at gmail.com> wrote:
>> Thanks, Jim (and Mark, who replied off-list) -- that does the trick. I
>> had tried using an index expression with grep, but that failed in the
>> same way as the subset method. It is still rather mysterious why this
>> works with regexpr but not with grep :)
>>
>> -Max
>>
>> On Fri, Mar 20, 2009 at 7:57 PM, jim holtman <jholtman at gmail.com> wrote:
>>> Try using regexpr instead:
>>>
>>>> x <- read.table(textConnection("input output corpusFreq pvolOT pvolRatioOT
>>> + give(mysister,theoldbook) P 47.0 56016 0.1543651
>>> + donate(her,thebook) P 48.7 68928 0.1899471
>>> + give(mysister,thebook) P 73.4 80136 0.2208333
>>> + donate(mysister,theoldbook) P 79.0 57024 0.1571429
>>> + give(mysister,it) P 100.0 132408 0.3648810
>>> + give(her,it) P 100.0 157248 0.4333333
>>> + donate(mysister,it) P 100.0 130720 0.3602293
>>> + give(her,thebook) P 5.7 65232 0.1797619
>>> + donate(her,it) P 100.0 152064 0.4190476
>>> + give(mylittlesister,thebook) P 91.8 112032 0.3087302
>>> + donate(mylittlesister,thebook) P 98.4 114048 0.3142857
>>> + donate(mysister,thebook) P 94.4 82800 0.2281746"), header=TRUE)
>>>> # use regexpr
>>>> matched <- regexpr("her", x$input) != -1
>>>> notMatched <- !matched
>>>> x[matched,]
>>> input output corpusFreq pvolOT pvolRatioOT
>>> 2 donate(her,thebook) P 48.7 68928 0.1899471
>>> 6 give(her,it) P 100.0 157248 0.4333333
>>> 8 give(her,thebook) P 5.7 65232 0.1797619
>>> 9 donate(her,it) P 100.0 152064 0.4190476
>>>> x[notMatched,]
>>> input output corpusFreq pvolOT pvolRatioOT
>>> 1 give(mysister,theoldbook) P 47.0 56016 0.1543651
>>> 3 give(mysister,thebook) P 73.4 80136 0.2208333
>>> 4 donate(mysister,theoldbook) P 79.0 57024 0.1571429
>>> 5 give(mysister,it) P 100.0 132408 0.3648810
>>> 7 donate(mysister,it) P 100.0 130720 0.3602293
>>> 10 give(mylittlesister,thebook) P 91.8 112032 0.3087302
>>> 11 donate(mylittlesister,thebook) P 98.4 114048 0.3142857
>>> 12 donate(mysister,thebook) P 94.4 82800 0.2281746
>>>>
>>>>
>>>
>>>
>>> On Fri, Mar 20, 2009 at 8:25 PM, Max Bane <max.bane at gmail.com> wrote:
>>>> I have some data that looks like this:
>>>>
>>>>> dataP
>>>> input output corpusFreq pvolOT pvolRatioOT
>>>> 1 give(my sister, the old book) P 47.0 56016 0.1543651
>>>> 5 donate(her, the book) P 48.7 68928 0.1899471
>>>> 9 give(my sister, the book) P 73.4 80136 0.2208333
>>>> 13 donate(my sister, the old book) P 79.0 57024 0.1571429
>>>> 20 give(my sister, it) P 100.0 132408 0.3648810
>>>> 21 give(her, it) P 100.0 157248 0.4333333
>>>> 24 donate(my sister, it) P 100.0 130720 0.3602293
>>>> 28 give(her, the book) P 5.7 65232 0.1797619
>>>> 31 donate(her, it) P 100.0 152064 0.4190476
>>>> 35 give(my little sister, the book) P 91.8 112032 0.3087302
>>>> 39 donate(my little sister, the book) P 98.4 114048 0.3142857
>>>> 43 donate(my sister, the book) P 94.4 82800 0.2281746
>>>>
>>>> I would like to extract the subset of this data in which the value of
>>>> the "input" column contains the substring "her". I was thinking I
>>>> could use the grep function to test for the presence of this
>>>> substring. For instance, if a string does not contain it, then grep
>>>> returns a zero length integer vector:
>>>>
>>>>> grep("her", "give(my sister, it)")
>>>> integer(0)
>>>>
>>>> And if the string does contain the substring, grep returns a vector of
>>>> the indices where the substring is located:
>>>>
>>>>> grep("her", "give(her, it)")
>>>> [1] 1
>>>>
>>>> I can thus test for the presence of the substring by converting the
>>>> length of the result of grep into a boolean:
>>>>
>>>>> as.logical(length(grep("her", "give(my sister, it)")))
>>>> [1] FALSE
>>>>> as.logical(length(grep("her", "give(her, it)")))
>>>> [1] TRUE
>>>>> as.logical(length(grep("her", "give(her, it)"))) == TRUE
>>>> [1] TRUE
>>>>> as.logical(length(grep("her", "give(my sister, it)"))) == TRUE
>>>> [1] FALSE
>>>>
>>>> I would like to use this test as a criterion for constructing a subset
>>>> of my data. Unfortunately, it does not work:
>>>>
>>>>> subset(dataP, as.logical(length(grep("her", input)))==TRUE)
>>>> input output corpusFreq pvolOT pvolRatioOT
>>>> 1 give(my sister, the old book) P 47.0 56016 0.1543651
>>>> 5 donate(her, the book) P 48.7 68928 0.1899471
>>>> 9 give(my sister, the book) P 73.4 80136 0.2208333
>>>> 13 donate(my sister, the old book) P 79.0 57024 0.1571429
>>>> 20 give(my sister, it) P 100.0 132408 0.3648810
>>>> 21 give(her, it) P 100.0 157248 0.4333333
>>>> 24 donate(my sister, it) P 100.0 130720 0.3602293
>>>> 28 give(her, the book) P 5.7 65232 0.1797619
>>>> 31 donate(her, it) P 100.0 152064 0.4190476
>>>> 35 give(my little sister, the book) P 91.8 112032 0.3087302
>>>> 39 donate(my little sister, the book) P 98.4 114048 0.3142857
>>>> 43 donate(my sister, the book) P 94.4 82800 0.2281746
>>>>
>>>> As you can see, I get back the whole data set, rather than just the
>>>> subset where the input column contains "her". And if I invert the
>>>> test, which I would expect to give the subset *not* containing "her",
>>>> I instead get the empty subset, rather mysteriously:
>>>>
>>>>> subset(dataP, as.logical(length(grep("her", input)))==FALSE)
>>>> [1] input output corpusFreq pvolOT pvolRatioOT
>>>> <0 rows> (or 0-length row.names)
>>>>
>>>> The type of the input column is definitely character. To be double sure:
>>>>
>>>>> subset(dataP, as.logical(length(grep("her", as.character(input))))==TRUE)
>>>>
>>>> does the same thing.
>>>>
>>>> Could somebody with more R experience than I have please explain what
>>>> I am doing wrong here? I'll be much obliged.
>>>>
>>>> --
>>>> Max Bane
>>>> PhD Student, Linguistics
>>>> University of Chicago
>>>> bane at uchicago.edu
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>
>>>
>>>
>>> --
>>> Jim Holtman
>>> Cincinnati, OH
>>> +1 513 646 9390
>>>
>>> What is the problem that you are trying to solve?
>>>
>>
>>
>>
>> --
>> Max Bane
>> PhD Student, Linguistics
>> University of Chicago
>> bane at uchicago.edu
>>
>
>
>
> --
> Jim Holtman
> Cincinnati, OH
> +1 513 646 9390
>
> What is the problem that you are trying to solve?
>
--
Max Bane
PhD Student, Linguistics
University of Chicago
bane at uchicago.edu
More information about the R-help
mailing list