[R] Extracting a numeric prefix from a string
McGehee, Robert
Robert.McGehee at geodecapital.com
Tue Feb 1 00:49:38 CET 2005
Perhaps an easier way would be to throw away the offending text at the
end of the strings, rather than matching all possible numeric
formulations at the beginning of the string, that is:
sub("\\.*[[:alpha:]]+$", "", x)
Easier to read, if nothing else, and it allows for 2e-7 as a valid
number. This however (I think correctly) assumes that there aren't
numbers in the middle of the string, i.e. 2a3b.
-----Original Message-----
From: Peter Dalgaard [mailto:p.dalgaard at biostat.ku.dk]
Sent: Monday, January 31, 2005 6:05 PM
To: ted.harding at nessie.mcc.ac.uk
Cc: R user; R-help at stat.math.ethz.ch; Mike White
Subject: Re: [R] Extracting a numeric prefix from a string
(Ted Harding) <Ted.Harding at nessie.mcc.ac.uk> writes:
> On 31-Jan-05 R user wrote:
> > You could use something like
> >
> > y <- gsub('([0-9]+(.[0-9]+)?)?.*','\\1',x)
> > as.numeric(y)
> >
> > But maybe there's a much nicer way.
> >
> > Jonne.
> I doubt it -- full marks for neat regexp footwork!
Hmm, I'd have to deduct a few points for forgetting to escape the dot...
> x <- "2a4"
> y <- gsub('([0-9]+(.[0-9]+)?)?.*','\\1',x)
> y
[1] "2a4"
> as.numeric(y)
[1] NA
Warning message:
NAs introduced by coercion
and maybe a few more for using gsub() where sub() suffices.
There are a few more nits to pick, since "2.", ".2", "2e-7" are also
numbers, but ".", ".e-2" are not. In fact it seems quite hard even to
handle all cases in, e.g.,
x <- c("2.2abc","2.def",".2ghi",".jkl")
with a single regular expression. The first one that worked for me was
> r <- regexpr('^(([0-9]+\\.?)|(\\.[0-9]+)|([0-9]+\\.[0-9]+))',x)
> substr(x,r,r+attr(r,"match.length")-1)
[1] "2.2" "2." ".2" ""
but several "obvious" attempts had failed.
The problem is that regular expressions try to find the
longest match, but not necessary of subexpressions, so
> sub('(([0-9]+\\.?)|(\\.[0-9]+)|([0-9]+\\.[0-9]+))?.*','\\1',x)
[1] "2." "2." ".2" ""
even though
> sub('(([0-9]+\\.?)|(\\.[0-9]+)|([0-9]+\\.[0-9]+))','XXX',x)
[1] "XXXabc" "XXXdef" "XXXghi" ".jkl"
Actually, this one comes pretty close:
> sub('([0-9]*(\\.[0-9]+)?)?.*','\\1',x)
[1] "2.2" "2" ".2" ""
It only loses a trailing dot which is immaterial in the present
context. However, next try extending the RE to handle an exponent
O__ ---- Peter Dalgaard Blegdamsvej 3
c/ /'_ --- Dept. of Biostatistics 2200 Cph. N
(*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
R-help at stat.math.ethz.ch mailing list
PLEASE do read the posting guide!
More information about the R-help
mailing list