[Rd] extending strsplit(): supply pattern to keep, not to split by

Bill Dunlap bill at insightful.com
Tue Apr 4 19:10:16 CEST 2006


On Tue, 4 Apr 2006, Gabor Grothendieck wrote:

> gsubfn in package gsubfn can do this.  See the examples
> in ?gsubfn

Thanks.  gsubfn looks useful, but may be overkill
for this, and it isn't vectorized.  To do what
strsplit(keep=T) would do I think you need to do something like:

   > findMatches<-function(strings, pattern){
        lapply(strings, function(string){
               v <- character()
               gsubfn(number.pattern, function(x,...)v<<-c(v,x), string)
               v})
     }
   > number.pattern <- "[-+]?(([0-9]+(\\.[0-9]*)?)|(\\.[0-9]+))([eE][+-]?[0-9]+)?"
   > findMatches(c("12;34:56,89,,12", "1.2, .4, 1., 1e3"), number.pattern)
   [[1]]
   [1] "12" "34" "56" "89" "12"

   [[2]]
   [1] "1.2" ".4"  "1."  "1e3"

Is this worth encapsulating in a standard R function?
If so, is doing via an extra argument to strsplit()
a reasonable way to do it?

   > strsplit(c("12;34:56,89,,12", "1.2, .4, 1., 1e3"), number.pattern, keep=T)
   [[1]]:
   [1] "12" "34" "56" "89" "12"

   [[2]]:
   [1] "1.2" ".4"  "1."  "1e3"


> On 4/4/06, Bill Dunlap <bill at insightful.com> wrote:
> > strsplit() is a convenient way to get a
> > list of items from a string when you
> > have a regular expression for what is not
> > an item.  E.g.,
> >
> >   > strsplit("1.2, 34, 1.7e-2", split="[ ,] *")
> >   [[1]]:
> >   [1] "1.2"    "34"     "1.7e-2"
> >
> > However, sometimes is it more convenient to
> > give a pattern for the items you do want.
> > E.g., suppose you want to pull all the numbers
> > out of a string which contains a mix of numbers
> > and words.  Making a pattern for what a
> > number is simpler than making a pattern
> > for what may come between the number.
> >   > number.pattern <- "[-+]?(([0-9]+(\\.[0-9]*)?)|(\\.[0-9]+))([eE][+-]?[0-9]+)?"
> >
> > I propose adding a keep=FALSE argument to
> > strsplit() to do this.  If keep is FALSE,
> > then the split argument matches the stuff to
> > omit from the output; if keep is TRUE then
> > split matches the stuff to put into the
> > output.  Then we could do the following to
> > get a list of all the numbers in a string
> > (done in a version of strsplit() I'm working on
> > for S-PLUS):
> >
> >   > strsplit("1.2, 34, 1.7e-2", split=number.pattern,keep=TRUE)
> >   [[1]]:
> >   [1] "1.2"    "34"     "1.7e-2"
> >
> >   > strsplit("Ibuprofin 200mg", split=number.pattern,keep=TRUE)
> >   [[1]]:
> >   [1] "200"
> >
> > Is this a reasonable thing to want strsplit to do?
> > Is this a reasonable parameterization of it?

----------------------------------------------------------------------------
Bill Dunlap
Insightful Corporation
bill at insightful dot com
360-428-8146

 "All statements in this message represent the opinions of the author and do
 not necessarily reflect Insightful Corporation policy or position."



More information about the R-devel mailing list