[R] help with regexp

Thu Oct 6 20:01:50 CEST 2011

Thanks to all who replied! With all these possible solutions it will be hard to find the best one :-).

--- Gabor Grothendieck <ggrothendieck at gmail.com> schrieb am Mi, 5.10.2011:

> Von: Gabor Grothendieck <ggrothendieck at gmail.com>
> Betreff: Re: [R] help with regexp
> An: "Jannis" <bt_jannis at yahoo.de>
> CC: r-help at stat.math.ethz.ch
> Datum: Mittwoch, 5. Oktober, 2011 15:13 Uhr
> On Wed, Oct 5, 2011 at 7:56 AM,
> Jannis <bt_jannis at yahoo.de>
> wrote:
> > Dear list memebers,
> >
> >
> > I am stuck with using regular expressions.
> >
> >
> > Imagine I have a vector of character strings like:
> >
> > test <- c('filename_1_def.pdf',
> 'filename_2_abc.pdf')
> >
> > How could I use regexpressions to extract only the
> 'def'/'abc' parts of these strings?
> >
> >
> > Some try from my side yielded no results:
> >
> > testresults <-
> grep('(?<=filename_[[:digit:]]_).{1,3}(?=.pdf)', perl =
> TRUE, value = TRUE)
> >
> > Somehow I seem to miss some important concept here.
> Until now I always used nested sub expressions like:
> >
> > testresults <- sub('.pdf$', '',
> sub('^filename_[[:digit:]]_', '' , test))
> >
> >
> > but this tends to become cumbersome and I was
> wondering whether there is a more elegant way to do this?
> >
> 
> Here are a couple of solutions:
> 
> # remove everything up to _b as well as everything from .
> onwards
> gsub(".*_|[.].*", "", test)
> 
> # extract everything that is not a _ provided it is
> immediately followed by .
> library(gsubfn)
> strapply(test, "([^_]+)[.]", simplify = TRUE)
> 
> -- 
> Statistics & Software Consulting
> GKX Group, GKX Associates Inc.
> tel: 1-877-GKX-GROUP
> email: ggrothendieck at gmail.com
>