[Rd] extending strsplit(): supply pattern to keep, not to split by

Gabor Grothendieck ggrothendieck at gmail.com
Tue Apr 4 19:39:33 CEST 2006


On 4/4/06, Bill Dunlap <bill at insightful.com> wrote:
> On Tue, 4 Apr 2006, Gabor Grothendieck wrote:
>
> > gsubfn in package gsubfn can do this.  See the examples
> > in ?gsubfn
>
> Thanks.  gsubfn looks useful, but may be overkill
> for this, and it isn't vectorized.  To do what

gsubfn is vectorized.  Its just that you are not using the output of
gsubfn in this case.

> strsplit(keep=T) would do I think you need to do something like:
>
>   > findMatches<-function(strings, pattern){
>        lapply(strings, function(string){
>               v <- character()
>               gsubfn(number.pattern, function(x,...)v<<-c(v,x), string)
>               v})
>     }
>   > number.pattern <- "[-+]?(([0-9]+(\\.[0-9]*)?)|(\\.[0-9]+))([eE][+-]?[0-9]+)?"
>   > findMatches(c("12;34:56,89,,12", "1.2, .4, 1., 1e3"), number.pattern)
>   [[1]]
>   [1] "12" "34" "56" "89" "12"
>
>   [[2]]
>   [1] "1.2" ".4"  "1."  "1e3"
>
> Is this worth encapsulating in a standard R function?

I will likely add a wrapper to the gsubfn package for this.

> If so, is doing via an extra argument to strsplit()
> a reasonable way to do it?

My current thought was to create a strapply function to do that.

>
>   > strsplit(c("12;34:56,89,,12", "1.2, .4, 1., 1e3"), number.pattern, keep=T)
>   [[1]]:
>   [1] "12" "34" "56" "89" "12"
>
>   [[2]]:
>   [1] "1.2" ".4"  "1."  "1e3"
>
>
> > On 4/4/06, Bill Dunlap <bill at insightful.com> wrote:
> > > strsplit() is a convenient way to get a
> > > list of items from a string when you
> > > have a regular expression for what is not
> > > an item.  E.g.,
> > >
> > >   > strsplit("1.2, 34, 1.7e-2", split="[ ,] *")
> > >   [[1]]:
> > >   [1] "1.2"    "34"     "1.7e-2"
> > >
> > > However, sometimes is it more convenient to
> > > give a pattern for the items you do want.
> > > E.g., suppose you want to pull all the numbers
> > > out of a string which contains a mix of numbers
> > > and words.  Making a pattern for what a
> > > number is simpler than making a pattern
> > > for what may come between the number.
> > >   > number.pattern <- "[-+]?(([0-9]+(\\.[0-9]*)?)|(\\.[0-9]+))([eE][+-]?[0-9]+)?"
> > >
> > > I propose adding a keep=FALSE argument to
> > > strsplit() to do this.  If keep is FALSE,
> > > then the split argument matches the stuff to
> > > omit from the output; if keep is TRUE then
> > > split matches the stuff to put into the
> > > output.  Then we could do the following to
> > > get a list of all the numbers in a string
> > > (done in a version of strsplit() I'm working on
> > > for S-PLUS):
> > >
> > >   > strsplit("1.2, 34, 1.7e-2", split=number.pattern,keep=TRUE)
> > >   [[1]]:
> > >   [1] "1.2"    "34"     "1.7e-2"
> > >
> > >   > strsplit("Ibuprofin 200mg", split=number.pattern,keep=TRUE)
> > >   [[1]]:
> > >   [1] "200"
> > >
> > > Is this a reasonable thing to want strsplit to do?
> > > Is this a reasonable parameterization of it?
>
> ----------------------------------------------------------------------------
> Bill Dunlap
> Insightful Corporation
> bill at insightful dot com
> 360-428-8146
>
>  "All statements in this message represent the opinions of the author and do
>  not necessarily reflect Insightful Corporation policy or position."
>



More information about the R-devel mailing list