[R] Extracting a numeric prefix from a string
Peter Dalgaard
p.dalgaard at biostat.ku.dk
Tue Feb 1 00:05:28 CET 2005
(Ted Harding) <Ted.Harding at nessie.mcc.ac.uk> writes:
> On 31-Jan-05 R user wrote:
> > You could use something like
> >
> > y <- gsub('([0-9]+(.[0-9]+)?)?.*','\\1',x)
> > as.numeric(y)
> >
> > But maybe there's a much nicer way.
> >
> > Jonne.
>
> I doubt it -- full marks for neat regexp footwork!
Hmm, I'd have to deduct a few points for forgetting to escape the dot...
> x <- "2a4"
> y <- gsub('([0-9]+(.[0-9]+)?)?.*','\\1',x)
> y
[1] "2a4"
> as.numeric(y)
[1] NA
Warning message:
NAs introduced by coercion
and maybe a few more for using gsub() where sub() suffices.
There are a few more nits to pick, since "2.", ".2", "2e-7" are also
numbers, but ".", ".e-2" are not. In fact it seems quite hard even to
handle all cases in, e.g.,
x <- c("2.2abc","2.def",".2ghi",".jkl")
with a single regular expression. The first one that worked for me was
> r <- regexpr('^(([0-9]+\\.?)|(\\.[0-9]+)|([0-9]+\\.[0-9]+))',x)
> substr(x,r,r+attr(r,"match.length")-1)
[1] "2.2" "2." ".2" ""
but several "obvious" attempts had failed.
The problem is that regular expressions try to find the
longest match, but not necessary of subexpressions, so
> sub('(([0-9]+\\.?)|(\\.[0-9]+)|([0-9]+\\.[0-9]+))?.*','\\1',x)
[1] "2." "2." ".2" ""
even though
> sub('(([0-9]+\\.?)|(\\.[0-9]+)|([0-9]+\\.[0-9]+))','XXX',x)
[1] "XXXabc" "XXXdef" "XXXghi" ".jkl"
Actually, this one comes pretty close:
> sub('([0-9]*(\\.[0-9]+)?)?.*','\\1',x)
[1] "2.2" "2" ".2" ""
It only loses a trailing dot which is immaterial in the present
context. However, next try extending the RE to handle an exponent
part...
--
O__ ---- Peter Dalgaard Blegdamsvej 3
c/ /'_ --- Dept. of Biostatistics 2200 Cph. N
(*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
More information about the R-help
mailing list