[R] Numbers in a string
William Dunlap
wdunlap at tibco.com
Thu Dec 16 18:07:53 CET 2010
In S+ strsplit() has a keep=TRUE/FALSE argument to
specify whether to return the substrings that match
the pattern or to return the substrings between
matches to the pattern (the default). E.g.,
> strings <- c("abcde. 11 abc 5.31e+34, (1.45)",
"AB15E9SDF654VKBN?dvb.65")
> number.pattern <- "[0-9]+\\.[0-9]+e[+-][0-9]+|[0-9]+\\.[0-9]+|[0-9]+"
> strsplit(strings, number.pattern, keep=TRUE)
[[1]]:
[1] "11" "5.31e+34" "1.45"
[[2]]:
[1] "15" "9" "654" "65"
> strsplit(strings, number.pattern, keep=FALSE)
[[1]]:
[1] "abcde. " " abc " ", (" ")"
[[2]]:
[1] "AB" "E" "SDF" "VKBN?dvb."
In R and S+ gregexpr can tell you the start points
and lengths of each match, but it is a pain to
pass this information to substring() to get the
matches themselves. Should [g]regexpr() have a
value= argument like grep has?
In R the gsubfn package can do this sort of thing.
I don't know if it worth adding more to base R's
strsplit().
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
> -----Original Message-----
> From: r-help-bounces at r-project.org
> [mailto:r-help-bounces at r-project.org] On Behalf Of Petr Savicky
> Sent: Thursday, December 16, 2010 8:42 AM
> To: r-help at r-project.org
> Subject: Re: [R] Numbers in a string
>
> On Thu, Dec 16, 2010 at 06:17:45AM -0800, Dieter Menne wrote:
> > Petr Savicky wrote:
> > >
> > > One of the suggestions in this thread was to use an
> external program.
> > > A possible solution without negation in Perl is
> > >
> > > @a = ("AB15E9SDF654VKBN?dvb.65" =~ m/[0-9]/g);
> > > print @a, "\n";
> > > 15965465
> > >
> > >
> >
> > Which is
> >
> > gsub("[^0-9]", "", "AB15E9SDF654VKBN?dvb.65")
> >
> > as Henrique suggested.
>
> I agree. The Perl code was a reply to a question, whether the
> same can be
> done by describing the required elements and not by
> describing the ones to
> be removed. This could be useful, if we want to extract
> elements described
> by a more complex regular expression. A more accurate, although not
> complete and definitely not the best, extraction of
> nonnegative numbers
> in Perl may be done as follows
>
> @a = ("abcde. 11 abc 5.31e+34, (1.45)" =~
> m/[0-9]+\.[0-9]+e[+-][0-9]+|[0-9]+\.[0-9]+|[0-9]+/g);
> print join(" ", @a), "\n";
> 11 5.31e+34 1.45
>
> Can something similar be done in R either specifically for numbers or
> for a general regular expression?
>
> Going back to the original question, the answer depends on
> the complexity of
> extracting numbers in a concrete situation. If possible,
> using functions
> within R is suggested (gsub(), strsplit(), ...). On the other
> hand, there
> are cases, where an external tool can be helpful. See also R-intro
> Chapter 7 Reading data from files, which says
>
> There is a clear presumption by the designers of R that you will be
> able to modify your input files using other tools, such as
> file editors
> or Perl to fit in with the requirements of R.
>
> Petr Savicky.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list