[R] Subsetting data where the condition is that the value of some column contains some substring
David Winsemius
dwinsemius at comcast.net
Sat Mar 21 02:49:03 CET 2009
If you use Jim's example and use grep() with ordinary and and then
negative indexing, you get these results:
> x[grep("her", x$input),]
input output corpusFreq pvolOT pvolRatioOT
2 donate(her,thebook) P 48.7 68928 0.1899471
6 give(her,it) P 100.0 157248 0.4333333
8 give(her,thebook) P 5.7 65232 0.1797619
9 donate(her,it) P 100.0 152064 0.4190476
> x[-grep("her", x$input),]
input output corpusFreq pvolOT pvolRatioOT
1 give(mysister,theoldbook) P 47.0 56016 0.1543651
3 give(mysister,thebook) P 73.4 80136 0.2208333
4 donate(mysister,theoldbook) P 79.0 57024 0.1571429
5 give(mysister,it) P 100.0 132408 0.3648810
7 donate(mysister,it) P 100.0 130720 0.3602293
10 give(mylittlesister,thebook) P 91.8 112032 0.3087302
11 donate(mylittlesister,thebook) P 98.4 114048 0.3142857
12 donate(mysister,thebook) P 94.4 82800 0.2281746
--
David.
On Mar 20, 2009, at 9:39 PM, jim holtman wrote:
> grep and regexpr return different values. regexpr returns a vector of
> the same length as the input and this can be used to construct a
> logical subscript. grep return a vector of only the matches, in which
> case you can have a length of zero if there are no matches. Makes it
> harder to create the subsets. You have to test for zero length and
> then do something special.
>
> On Fri, Mar 20, 2009 at 9:20 PM, Max Bane <max.bane at gmail.com> wrote:
>> Thanks, Jim (and Mark, who replied off-list) -- that does the
>> trick. I
>> had tried using an index expression with grep, but that failed in the
>> same way as the subset method. It is still rather mysterious why this
>> works with regexpr but not with grep :)
>>
>> -Max
>>
>> On Fri, Mar 20, 2009 at 7:57 PM, jim holtman <jholtman at gmail.com>
>> wrote:
>>> Try using regexpr instead:
>>>
>>>> x <- read.table(textConnection("input output corpusFreq pvolOT
>>>> pvolRatioOT
>>> + give(mysister,theoldbook) P 47.0 56016 0.1543651
>>> + donate(her,thebook) P 48.7 68928 0.1899471
>>> + give(mysister,thebook) P 73.4 80136 0.2208333
>>> + donate(mysister,theoldbook) P 79.0 57024 0.1571429
>>> + give(mysister,it) P 100.0 132408 0.3648810
>>> + give(her,it) P 100.0 157248 0.4333333
>>> + donate(mysister,it) P 100.0 130720 0.3602293
>>> + give(her,thebook) P 5.7 65232 0.1797619
>>> + donate(her,it) P 100.0 152064 0.4190476
>>> + give(mylittlesister,thebook) P 91.8 112032 0.3087302
>>> + donate(mylittlesister,thebook) P 98.4 114048
>>> 0.3142857
>>> + donate(mysister,thebook) P 94.4 82800 0.2281746"),
>>> header=TRUE)
>>>> # use regexpr
>>>> matched <- regexpr("her", x$input) != -1
>>>> notMatched <- !matched
>>>> x[matched,]
>>> input output corpusFreq pvolOT pvolRatioOT
>>> 2 donate(her,thebook) P 48.7 68928 0.1899471
>>> 6 give(her,it) P 100.0 157248 0.4333333
>>> 8 give(her,thebook) P 5.7 65232 0.1797619
>>> 9 donate(her,it) P 100.0 152064 0.4190476
>>>> x[notMatched,]
>>> input output corpusFreq pvolOT
>>> pvolRatioOT
>>> 1 give(mysister,theoldbook) P 47.0 56016
>>> 0.1543651
>>> 3 give(mysister,thebook) P 73.4 80136
>>> 0.2208333
>>> 4 donate(mysister,theoldbook) P 79.0 57024
>>> 0.1571429
>>> 5 give(mysister,it) P 100.0 132408
>>> 0.3648810
>>> 7 donate(mysister,it) P 100.0 130720
>>> 0.3602293
>>> 10 give(mylittlesister,thebook) P 91.8 112032
>>> 0.3087302
>>> 11 donate(mylittlesister,thebook) P 98.4 114048
>>> 0.3142857
>>> 12 donate(mysister,thebook) P 94.4 82800
>>> 0.2281746
>>>>
>>>>
>>>
>>>
>>> On Fri, Mar 20, 2009 at 8:25 PM, Max Bane <max.bane at gmail.com>
>>> wrote:
>>>> I have some data that looks like this:
>>>>
>>>>> dataP
>>>> input output corpusFreq pvolOT
>>>> pvolRatioOT
>>>> 1 give(my sister, the old book) P 47.0 56016
>>>> 0.1543651
>>>> 5 donate(her, the book) P 48.7 68928
>>>> 0.1899471
>>>> 9 give(my sister, the book) P 73.4 80136
>>>> 0.2208333
>>>> 13 donate(my sister, the old book) P 79.0 57024
>>>> 0.1571429
>>>> 20 give(my sister, it) P 100.0 132408
>>>> 0.3648810
>>>> 21 give(her, it) P 100.0 157248
>>>> 0.4333333
>>>> 24 donate(my sister, it) P 100.0 130720
>>>> 0.3602293
>>>> 28 give(her, the book) P 5.7 65232
>>>> 0.1797619
>>>> 31 donate(her, it) P 100.0 152064
>>>> 0.4190476
>>>> 35 give(my little sister, the book) P 91.8 112032
>>>> 0.3087302
>>>> 39 donate(my little sister, the book) P 98.4 114048
>>>> 0.3142857
>>>> 43 donate(my sister, the book) P 94.4 82800
>>>> 0.2281746
>>>>
>>>> I would like to extract the subset of this data in which the
>>>> value of
>>>> the "input" column contains the substring "her". I was thinking I
>>>> could use the grep function to test for the presence of this
>>>> substring. For instance, if a string does not contain it, then grep
>>>> returns a zero length integer vector:
>>>>
>>>>> grep("her", "give(my sister, it)")
>>>> integer(0)
>>>>
>>>> And if the string does contain the substring, grep returns a
>>>> vector of
>>>> the indices where the substring is located:
>>>>
>>>>> grep("her", "give(her, it)")
>>>> [1] 1
>>>>
>>>> I can thus test for the presence of the substring by converting the
>>>> length of the result of grep into a boolean:
>>>>
>>>>> as.logical(length(grep("her", "give(my sister, it)")))
>>>> [1] FALSE
>>>>> as.logical(length(grep("her", "give(her, it)")))
>>>> [1] TRUE
>>>>> as.logical(length(grep("her", "give(her, it)"))) == TRUE
>>>> [1] TRUE
>>>>> as.logical(length(grep("her", "give(my sister, it)"))) == TRUE
>>>> [1] FALSE
>>>>
>>>> I would like to use this test as a criterion for constructing a
>>>> subset
>>>> of my data. Unfortunately, it does not work:
>>>>
>>>>> subset(dataP, as.logical(length(grep("her", input)))==TRUE)
>>>> input output corpusFreq pvolOT
>>>> pvolRatioOT
>>>> 1 give(my sister, the old book) P 47.0 56016
>>>> 0.1543651
>>>> 5 donate(her, the book) P 48.7 68928
>>>> 0.1899471
>>>> 9 give(my sister, the book) P 73.4 80136
>>>> 0.2208333
>>>> 13 donate(my sister, the old book) P 79.0 57024
>>>> 0.1571429
>>>> 20 give(my sister, it) P 100.0 132408
>>>> 0.3648810
>>>> 21 give(her, it) P 100.0 157248
>>>> 0.4333333
>>>> 24 donate(my sister, it) P 100.0 130720
>>>> 0.3602293
>>>> 28 give(her, the book) P 5.7 65232
>>>> 0.1797619
>>>> 31 donate(her, it) P 100.0 152064
>>>> 0.4190476
>>>> 35 give(my little sister, the book) P 91.8 112032
>>>> 0.3087302
>>>> 39 donate(my little sister, the book) P 98.4 114048
>>>> 0.3142857
>>>> 43 donate(my sister, the book) P 94.4 82800
>>>> 0.2281746
>>>>
>>>> As you can see, I get back the whole data set, rather than just the
>>>> subset where the input column contains "her". And if I invert the
>>>> test, which I would expect to give the subset *not* containing
>>>> "her",
>>>> I instead get the empty subset, rather mysteriously:
>>>>
>>>>> subset(dataP, as.logical(length(grep("her", input)))==FALSE)
>>>> [1] input output corpusFreq pvolOT pvolRatioOT
>>>> <0 rows> (or 0-length row.names)
>>>>
>>>> The type of the input column is definitely character. To be
>>>> double sure:
>>>>
>>>>> subset(dataP, as.logical(length(grep("her",
>>>>> as.character(input))))==TRUE)
>>>>
>>>> does the same thing.
>>>>
>>>> Could somebody with more R experience than I have please explain
>>>> what
>>>> I am doing wrong here? I'll be much obliged.
>>>>
>>>
>>
David Winsemius, MD
Heritage Laboratories
West Hartford, CT
More information about the R-help
mailing list