[R] detecting any element in a vector of strings, appearing anywhere in any of several character variables in a dataframe
John Fox
jfox at mcmaster.ca
Thu Jul 9 21:24:00 CEST 2015
Dear Christopher,
My usual orientation to this kind of one-off problem is that I'm looking for a simple correct solution. Computing time is usually much smaller than programming time.
That said, Bert Gunter's solution was about 5 times faster in a simple check that I ran with microbenchmark, and Jeff Newmiller's solution was about 10 times faster. Both Bert's and Jeff's (eventual) solution protect against partial (rather than full-word) matches, while mine doesn't (though it could easily be modified to do that).
Best,
John
> -----Original Message-----
> From: Christopher W Ryan [mailto:cryan at binghamton.edu]
> Sent: July-09-15 2:49 PM
> To: Bert Gunter
> Cc: Jeff Newmiller; R Help; John Fox
> Subject: Re: [R] detecting any element in a vector of strings, appearing
> anywhere in any of several character variables in a dataframe
>
> Thanks everyone. John's original solution worked great. And with
> 27,000 records, 65 alarm.words, and 6 columns to search, it takes only
> about 15 seconds. That is certainly adequate for my needs. But I
> will try out the other strategies too.
>
> And thanks also for lot's of new R things to learn--grep, grepl,
> do.call . . . that's always a bonus!
>
> --Chris Ryan
>
> On Thu, Jul 9, 2015 at 1:52 PM, Bert Gunter <bgunter.4567 at gmail.com>
> wrote:
> > Yup, that does it. Let grep figure out what's a word rather than doing
> > it manually. Forgot about "\b"
> >
> > Cheers,
> > Bert
> >
> >
> > Bert Gunter
> >
> > "Data is not information. Information is not knowledge. And knowledge
> > is certainly not wisdom."
> > -- Clifford Stoll
> >
> >
> > On Thu, Jul 9, 2015 at 10:30 AM, Jeff Newmiller
> > <jdnewmil at dcn.davis.ca.us> wrote:
> >> Just add a word break marker before and after:
> >>
> >> zz$v5 <- grepl( paste0( "\\b(", paste0( alarm.words, collapse="|" ),
> ")\\b" ), do.call( paste, zz[ , 2:3 ] ) ) )
> >> ---------------------------------------------------------------------
> ------
> >> Jeff Newmiller The ..... ..... Go
> Live...
> >> DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live
> Go...
> >> Live: OO#.. Dead: OO#..
> Playing
> >> Research Engineer (Solar/Batteries O.O#. #.O#. with
> >> /Software/Embedded Controllers) .OO#. .OO#.
> rocks...1k
> >> ---------------------------------------------------------------------
> ------
> >> Sent from my phone. Please excuse my brevity.
> >>
> >> On July 9, 2015 10:12:23 AM PDT, Bert Gunter <bgunter.4567 at gmail.com>
> wrote:
> >>>Jeff:
> >>>
> >>>Well, it would be much better (no loops!) except, I think, for one
> >>>issue: "red" would match "barred" and I don't think that this is what
> >>>is wanted: the matches should be on whole "words" not just string
> >>>patterns.
> >>>
> >>>So you would need to fix up the matching pattern to make this work,
> >>>but it may be a little tricky, as arbitrary whitespace characters,
> >>>e.g. " " or "\n" etc. could be in the strings to be matched
> separating
> >>>the words or ending the "sentence." I'm sure it can be done, but
> I'll
> >>>leave it to you or others to figure it out.
> >>>
> >>>Of course, if my diagnosis is wrong or silly, please point this out.
> >>>
> >>>Cheers,
> >>>Bert
> >>>
> >>>
> >>>Bert Gunter
> >>>
> >>>"Data is not information. Information is not knowledge. And knowledge
> >>>is certainly not wisdom."
> >>> -- Clifford Stoll
> >>>
> >>>
> >>>On Thu, Jul 9, 2015 at 9:34 AM, Jeff Newmiller
> >>><jdnewmil at dcn.davis.ca.us> wrote:
> >>>> I think grep is better suited to this:
> >>>>
> >>>> zz$v5 <- grepl( paste0( alarm.words, collapse="|" ), do.call(
> paste,
> >>>zz[ , 2:3 ] ) ) )
> >>>>
> >>>---------------------------------------------------------------------
> ------
> >>>> Jeff Newmiller The ..... ..... Go
> >>>Live...
> >>>> DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#.
> Live
> >>>Go...
> >>>> Live: OO#.. Dead: OO#..
> >>>Playing
> >>>> Research Engineer (Solar/Batteries O.O#. #.O#.
> with
> >>>> /Software/Embedded Controllers) .OO#. .OO#.
> >>>rocks...1k
> >>>>
> >>>---------------------------------------------------------------------
> ------
> >>>> Sent from my phone. Please excuse my brevity.
> >>>>
> >>>> On July 9, 2015 8:51:10 AM PDT, Bert Gunter
> <bgunter.4567 at gmail.com>
> >>>wrote:
> >>>>>Here's a way to do it that uses %in% (i.e. match() ) and uses only
> a
> >>>>>single, not a double, loop. It should be more efficient.
> >>>>>
> >>>>>> sapply(strsplit(do.call(paste,zz[,2:3]),"[[:space:]]+"),
> >>>>>+ function(x)any(x %in% alarm.words))
> >>>>>
> >>>>> [1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE TRUE
> >>>>>
> >>>>>The idea is to paste the strings in each row (do.call allows an
> >>>>>arbitrary number of columns) into a single string and then use
> >>>>>strsplit to break the string into individual "words" on whitespace.
> >>>>>Then the matching is vectorized with the any( %in% ... ) call.
> >>>>>
> >>>>>Cheers,
> >>>>>Bert
> >>>>>Bert Gunter
> >>>>>
> >>>>>"Data is not information. Information is not knowledge. And
> knowledge
> >>>>>is certainly not wisdom."
> >>>>> -- Clifford Stoll
> >>>>>
> >>>>>
> >>>>>On Thu, Jul 9, 2015 at 6:05 AM, John Fox <jfox at mcmaster.ca> wrote:
> >>>>>> Dear Chris,
> >>>>>>
> >>>>>> If I understand correctly what you want, how about the following?
> >>>>>>
> >>>>>>> rows <- apply(zz[, 2:3], 1, function(x) any(sapply(alarm.words,
> >>>>>grepl, x=x)))
> >>>>>>> zz[rows, ]
> >>>>>>
> >>>>>> v1 v2 v3 v4
> >>>>>> 3 -1.022329 green turtle ronald weasley 2
> >>>>>> 6 0.336599 waffle the hamster red sparks 1
> >>>>>> 9 -1.631874 yellow giraffe with a long neck gandalf the white 1
> >>>>>> 10 1.130622 black bear gandalf the grey 2
> >>>>>>
> >>>>>> I hope this helps,
> >>>>>> John
> >>>>>>
> >>>>>> ------------------------------------------------
> >>>>>> John Fox, Professor
> >>>>>> McMaster University
> >>>>>> Hamilton, Ontario, Canada
> >>>>>> http://socserv.mcmaster.ca/jfox/
> >>>>>>
> >>>>>>
> >>>>>> On Wed, 08 Jul 2015 22:23:37 -0400
> >>>>>> "Christopher W. Ryan" <cryan at binghamton.edu> wrote:
> >>>>>>> Running R 3.1.1 on windows 7
> >>>>>>>
> >>>>>>> I want to identify as a case any record in a dataframe that
> >>>contains
> >>>>>any
> >>>>>>> of several keywords in any of several variables.
> >>>>>>>
> >>>>>>> Example:
> >>>>>>>
> >>>>>>> # create a dataframe with 4 variables and 10 records
> >>>>>>> v2 <- c("white bird", "blue bird", "green turtle", "quick brown
> >>>>>fox",
> >>>>>>> "big black dog", "waffle the hamster", "benny likes food a lot",
> >>>>>"hello
> >>>>>>> world", "yellow giraffe with a long neck", "black bear")
> >>>>>>> v3 <- c("harry potter", "hermione grainger", "ronald weasley",
> >>>>>"ginny
> >>>>>>> weasley", "dudley dursley", "red sparks", "blue sparks", "white
> >>>>>dress
> >>>>>>> robes", "gandalf the white", "gandalf the grey")
> >>>>>>> zz <- data.frame(v1=rnorm(10), v2=v2, v3=v3, v4=rpois(10,
> >>>lambda=2),
> >>>>>>> stringsAsFactors=FALSE)
> >>>>>>> str(zz)
> >>>>>>> zz
> >>>>>>>
> >>>>>>> # here are the keywords
> >>>>>>> alarm.words <- c("red", "green", "turtle", "gandalf")
> >>>>>>>
> >>>>>>> # For each row/record, I want to test whether the string in v2
> or
> >>>>>the
> >>>>>>> string in v3 contains any of the strings in alarm.words. And
> then
> >>>if
> >>>>>so,
> >>>>>>> set zz$v5=TRUE for that record.
> >>>>>>>
> >>>>>>> # I'm thinking the str_detect function in the stringr package
> >>>ought
> >>>>>to
> >>>>>>> be able to help, perhaps with some use of apply over the rows,
> but
> >>>I
> >>>>>>> obviously misunderstand something about how str_detect works
> >>>>>>>
> >>>>>>> library(stringr)
> >>>>>>>
> >>>>>>> str_detect(zz[,2:3], alarm.words) # error: the target of the
> >>>>>search
> >>>>>>> # must be a vector, not
> >>>>>multiple
> >>>>>>> # columns
> >>>>>>>
> >>>>>>> str_detect(zz[1:4,2:3], alarm.words) # same error
> >>>>>>>
> >>>>>>> str_detect(zz[,2], alarm.words) # error, length of
> >>>alarm.words
> >>>>>>> # is less than the number
> of
> >>>>>>> # rows I am using for the
> >>>>>>> # comparison
> >>>>>>>
> >>>>>>> str_detect(zz[1:4,2], alarm.words) # works as hoped when
> >>>>>>> length(alarm.words) # confining nrows
> >>>>>>> # to the length of
> >>>alarm.words
> >>>>>>>
> >>>>>>> str_detect(zz, alarm.words) # obviously not right
> >>>>>>>
> >>>>>>> # maybe I need apply() ?
> >>>>>>> my.f <- function(x){str_detect(x, alarm.words)}
> >>>>>>>
> >>>>>>> apply(zz[,2], 1, my.f) # again, a mismatch in lengths
> >>>>>>> # between alarm.words and that
> >>>>>>> # in which I am searching for
> >>>>>>> # matching strings
> >>>>>>>
> >>>>>>> apply(zz, 2, my.f) # now I'm getting somewhere
> >>>>>>> apply(zz[1:4,], 2, my.f) # but still only works with 4
> >>>>>>> # rows of the dataframe
> >>>>>>>
> >>>>>>>
> >>>>>>> # perhaps %in% could do the job?
> >>>>>>>
> >>>>>>> Appreciate any advice.
> >>>>>>>
> >>>>>>> --Chris Ryan
> >>>>>>>
> >>>>>>> ______________________________________________
> >>>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
> see
> >>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>>>> PLEASE do read the posting guide
> >>>>>http://www.R-project.org/posting-guide.html
> >>>>>>> and provide commented, minimal, self-contained, reproducible
> code.
> >>>>>>
> >>>>>> ______________________________________________
> >>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>>> PLEASE do read the posting guide
> >>>>>http://www.R-project.org/posting-guide.html
> >>>>>> and provide commented, minimal, self-contained, reproducible
> code.
> >>>>>
> >>>>>______________________________________________
> >>>>>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >>>>>https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>>PLEASE do read the posting guide
> >>>>>http://www.R-project.org/posting-guide.html
> >>>>>and provide commented, minimal, self-contained, reproducible code.
> >>>>
> >>
> >
> > ______________________________________________
> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> > and provide commented, minimal, self-contained, reproducible code.
---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus
More information about the R-help
mailing list