[R] glob2rx() {was: no bug in R2.1.0's list.files()}

Martin Maechler maechler at stat.math.ethz.ch
Thu May 12 20:19:13 CEST 2005


>>>>> "BaRow" == Barry Rowlingson <B.Rowlingson at lancaster.ac.uk>
>>>>>     on Thu, 12 May 2005 11:05:43 +0100 writes:

    BaRow> Uwe Ligges wrote:
    >> Please read about regular expressions (!!!) and try to
    >> understand that ".txt" also finds "Not_a_txt_file.xls"
    >> ....


    BaRow>   The confusion here is between regular expressions
    BaRow> and wildcard expansion known as 'globbing'. The two
    BaRow> things are very different, and use characters such as
    BaRow> '*' '.' and '?' in different ways.

Exactly,  I had devised  a  "glob" to "regexp" function many
years ago in order to help newbies make the transition.

That function, nowadays, called 'glob2rx' has been part of our
(CRAN) package "sfsmisc" and hence available to all via
 
       install.packages("sfsmisc")
       library("sfsmisc")

But it's quite simple (though not trivial to read for the
inexperienced because of the many escapes ("\") needed)
and it maybe helpful to see its code on R-help, below.
Then, this topic has lead me to add 2 (obvious in hindsight)
logical optional arguments to the function so that it now looks like

glob2rx <- function(pattern, trim.head = FALSE, trim.tail = TRUE)
{
    ## Purpose: Change "ls" aka "wildcard" aka "globbing" _pattern_ to
    ##	      Regular Expression (as in grep, perl, emacs, ...)
    ## -------------------------------------------------------------------------
    ## Author: Martin Maechler ETH Zurich, ~ 1991
    ##	       New version using [g]sub() : 2004
    p <- gsub('\\.','\\\\.', paste('^', pattern, '$', sep=''))
    p <- gsub('\\?',	 '.',  gsub('\\*',  '.*', p))
    ## these are trimming '.*$' and '^.*' - in most cases only for esthetics
    if(trim.tail) p <- sub("\\.\\*\\$$", '', p)
    if(trim.head) p <- sub("\\^\\.\\*",  '', p)
    p
}


So those confused newbies (and DOS long timers!)
could use

      list.files(myloc, glob2rx("*.zip"), full=TRUE)

            ## (yes, make a habit of using 'TRUE', not 'T' ..)

The current example code, BTW, has

    stopifnot(glob2rx("abc.*") == "^abc\\.",
               glob2rx("a?b.*") == "^a.b\\.",
               glob2rx("a?b.*", trim.tail=FALSE) == "^a.b\\..*$",
               glob2rx("*.doc") == "^.*\\.doc$",
               glob2rx("*.doc", trim.head=TRUE) == "\\.doc$",
               glob2rx("*.t*")  == "^.*\\.t",
               glob2rx("*.t??") == "^.*\\.t..$"
     )


Martin Maechler,
ETH Zurich


    BaRow>   There's added confusion when people come from a DOS
    BaRow> background, where commands did their own thing when
    BaRow> given '*' as parameter. The DOS command:

    BaRow>   RENAME *.FOO *.BAR

    BaRow>   did what seems obvious, renaming all the .FOO files
    BaRow> to .BAR, but on a unix machine doing this with 'mv'
    BaRow> can be destructive!

    BaRow>   In short (and slightly simplified), a '*' when
    BaRow> expanded as a wildcard in a glob matches any string,
    BaRow> whereas a '*' in a regular expression (regexp),
    BaRow> matches the previous character 0 or more times. This
    BaRow> is why "*.zip" is flagged as invalid now - there's no
    BaRow> character before the "*".

    BaRow>   That should be enough clues to send you on your
    BaRow> way.

    BaRow>   Baz




More information about the R-help mailing list