[R] Numbers in a string

William Dunlap wdunlap at tibco.com
Thu Dec 16 18:07:53 CET 2010


In S+ strsplit() has a keep=TRUE/FALSE argument to
specify whether to return the substrings that match
the pattern or to return the substrings between
matches to the pattern (the default).  E.g.,

> strings <- c("abcde. 11 abc 5.31e+34, (1.45)",
"AB15E9SDF654VKBN?dvb.65")
> number.pattern <- "[0-9]+\\.[0-9]+e[+-][0-9]+|[0-9]+\\.[0-9]+|[0-9]+"
> strsplit(strings, number.pattern, keep=TRUE)
[[1]]:
[1] "11"       "5.31e+34" "1.45"    

[[2]]:
[1] "15"  "9"   "654" "65" 

> strsplit(strings, number.pattern, keep=FALSE)
[[1]]:
[1] "abcde. " " abc "   ", ("     ")"      

[[2]]:
[1] "AB"        "E"         "SDF"       "VKBN?dvb."

In R and S+ gregexpr can tell you the start points
and lengths of each match, but it is a pain to
pass this information to substring() to get the
matches themselves.  Should [g]regexpr() have a
value= argument like grep has?

In R the gsubfn package can do this sort of thing.
I don't know if it worth adding more to base R's
strsplit().

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com  

> -----Original Message-----
> From: r-help-bounces at r-project.org 
> [mailto:r-help-bounces at r-project.org] On Behalf Of Petr Savicky
> Sent: Thursday, December 16, 2010 8:42 AM
> To: r-help at r-project.org
> Subject: Re: [R] Numbers in a string
> 
> On Thu, Dec 16, 2010 at 06:17:45AM -0800, Dieter Menne wrote:
> > Petr Savicky wrote:
> > > 
> > > One of the suggestions in this thread was to use an 
> external program.
> > > A possible solution without negation in Perl is
> > > 
> > >   @a = ("AB15E9SDF654VKBN?dvb.65" =~ m/[0-9]/g);
> > >   print @a, "\n";
> > >   15965465
> > > 
> > > 
> > 
> > Which is
> > 
> >  gsub("[^0-9]", "", "AB15E9SDF654VKBN?dvb.65")
> > 
> > as Henrique suggested.
> 
> I agree. The Perl code was a reply to a question, whether the 
> same can be
> done by describing the required elements and not by 
> describing the ones to
> be removed. This could be useful, if we want to extract 
> elements described
> by a more complex regular expression. A more accurate, although not
> complete and definitely not the best, extraction of 
> nonnegative numbers
> in Perl may be done as follows
> 
>   @a = ("abcde. 11 abc 5.31e+34, (1.45)" =~ 
> m/[0-9]+\.[0-9]+e[+-][0-9]+|[0-9]+\.[0-9]+|[0-9]+/g);
>   print join(" ", @a), "\n";
>   11 5.31e+34 1.45
> 
> Can something similar be done in R either specifically for numbers or
> for a general regular expression?
> 
> Going back to the original question, the answer depends on 
> the complexity of
> extracting numbers in a concrete situation. If possible, 
> using functions
> within R is suggested (gsub(), strsplit(), ...). On the other 
> hand, there
> are cases, where an external tool can be helpful. See also R-intro
> Chapter 7 Reading data from files, which says
> 
>   There is a clear presumption by the designers of R that you will be
>   able to modify your input files using other tools, such as 
> file editors
>   or Perl to fit in with the requirements of R.
> 
> Petr Savicky.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 



More information about the R-help mailing list