[R] detecting any element in a vector of strings, appearing anywhere in any of several character variables in a dataframe
Jeff Newmiller
jdnewmil at dcn.davis.CA.us
Thu Jul 9 18:34:28 CEST 2015
I think grep is better suited to this:
zz$v5 <- grepl( paste0( alarm.words, collapse="|" ), do.call( paste, zz[ , 2:3 ] ) ) )
---------------------------------------------------------------------------
Jeff Newmiller The ..... ..... Go Live...
DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go...
Live: OO#.. Dead: OO#.. Playing
Research Engineer (Solar/Batteries O.O#. #.O#. with
/Software/Embedded Controllers) .OO#. .OO#. rocks...1k
---------------------------------------------------------------------------
Sent from my phone. Please excuse my brevity.
On July 9, 2015 8:51:10 AM PDT, Bert Gunter <bgunter.4567 at gmail.com> wrote:
>Here's a way to do it that uses %in% (i.e. match() ) and uses only a
>single, not a double, loop. It should be more efficient.
>
>> sapply(strsplit(do.call(paste,zz[,2:3]),"[[:space:]]+"),
>+ function(x)any(x %in% alarm.words))
>
> [1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE TRUE
>
>The idea is to paste the strings in each row (do.call allows an
>arbitrary number of columns) into a single string and then use
>strsplit to break the string into individual "words" on whitespace.
>Then the matching is vectorized with the any( %in% ... ) call.
>
>Cheers,
>Bert
>Bert Gunter
>
>"Data is not information. Information is not knowledge. And knowledge
>is certainly not wisdom."
> -- Clifford Stoll
>
>
>On Thu, Jul 9, 2015 at 6:05 AM, John Fox <jfox at mcmaster.ca> wrote:
>> Dear Chris,
>>
>> If I understand correctly what you want, how about the following?
>>
>>> rows <- apply(zz[, 2:3], 1, function(x) any(sapply(alarm.words,
>grepl, x=x)))
>>> zz[rows, ]
>>
>> v1 v2 v3 v4
>> 3 -1.022329 green turtle ronald weasley 2
>> 6 0.336599 waffle the hamster red sparks 1
>> 9 -1.631874 yellow giraffe with a long neck gandalf the white 1
>> 10 1.130622 black bear gandalf the grey 2
>>
>> I hope this helps,
>> John
>>
>> ------------------------------------------------
>> John Fox, Professor
>> McMaster University
>> Hamilton, Ontario, Canada
>> http://socserv.mcmaster.ca/jfox/
>>
>>
>> On Wed, 08 Jul 2015 22:23:37 -0400
>> "Christopher W. Ryan" <cryan at binghamton.edu> wrote:
>>> Running R 3.1.1 on windows 7
>>>
>>> I want to identify as a case any record in a dataframe that contains
>any
>>> of several keywords in any of several variables.
>>>
>>> Example:
>>>
>>> # create a dataframe with 4 variables and 10 records
>>> v2 <- c("white bird", "blue bird", "green turtle", "quick brown
>fox",
>>> "big black dog", "waffle the hamster", "benny likes food a lot",
>"hello
>>> world", "yellow giraffe with a long neck", "black bear")
>>> v3 <- c("harry potter", "hermione grainger", "ronald weasley",
>"ginny
>>> weasley", "dudley dursley", "red sparks", "blue sparks", "white
>dress
>>> robes", "gandalf the white", "gandalf the grey")
>>> zz <- data.frame(v1=rnorm(10), v2=v2, v3=v3, v4=rpois(10, lambda=2),
>>> stringsAsFactors=FALSE)
>>> str(zz)
>>> zz
>>>
>>> # here are the keywords
>>> alarm.words <- c("red", "green", "turtle", "gandalf")
>>>
>>> # For each row/record, I want to test whether the string in v2 or
>the
>>> string in v3 contains any of the strings in alarm.words. And then if
>so,
>>> set zz$v5=TRUE for that record.
>>>
>>> # I'm thinking the str_detect function in the stringr package ought
>to
>>> be able to help, perhaps with some use of apply over the rows, but I
>>> obviously misunderstand something about how str_detect works
>>>
>>> library(stringr)
>>>
>>> str_detect(zz[,2:3], alarm.words) # error: the target of the
>search
>>> # must be a vector, not
>multiple
>>> # columns
>>>
>>> str_detect(zz[1:4,2:3], alarm.words) # same error
>>>
>>> str_detect(zz[,2], alarm.words) # error, length of alarm.words
>>> # is less than the number of
>>> # rows I am using for the
>>> # comparison
>>>
>>> str_detect(zz[1:4,2], alarm.words) # works as hoped when
>>> length(alarm.words) # confining nrows
>>> # to the length of alarm.words
>>>
>>> str_detect(zz, alarm.words) # obviously not right
>>>
>>> # maybe I need apply() ?
>>> my.f <- function(x){str_detect(x, alarm.words)}
>>>
>>> apply(zz[,2], 1, my.f) # again, a mismatch in lengths
>>> # between alarm.words and that
>>> # in which I am searching for
>>> # matching strings
>>>
>>> apply(zz, 2, my.f) # now I'm getting somewhere
>>> apply(zz[1:4,], 2, my.f) # but still only works with 4
>>> # rows of the dataframe
>>>
>>>
>>> # perhaps %in% could do the job?
>>>
>>> Appreciate any advice.
>>>
>>> --Chris Ryan
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>______________________________________________
>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list