[R] gsub syntax

Sun Nov 27 17:41:40 CET 2005

R is blameless here: it works as documented and in the same way as 
POSIX tools.  It agrees with 'sed' using the same syntax (modulo the 
shell-specific quoting rules) e.g. in csh

    % echo 1973 | sed 's/[19|20]\([0-9][0-9]\)/\1/g'
    973
    % echo 1973 | sed 's/\([19|20]\)\([0-9][0-9]\)/-\1-\2-/g'
    -1-97-3
    % echo "73 74 02 1973 1974 2002" | sed 's/[19|20]\([0-9][0-9]\)/\1/g'
    73 74 02 973 974 002

so what happened when you were 'comparing with sed'?

"[19|20]" is a character class (containing five characters) matching one 
character, not a match for two characters as you seem to imagine.  It does 
not mean the same as "19|20", which is what you seem to have intended (and 
you seem only to want to do the substitution once on each string, so why 
use gsub?):

> sub("19|20([0-9][0-9])", "\\1", dates)
[1] "73" "74" "02" "73" "74" "02"

A more direct way which would work e.g. for 1837 would be

sub(".*([0-9]{2}$)", "\\1", dates)

or even better (locale-independent)

sub(".*([[:digit:]]{2}$)", "\\1", dates)

Current versions of R have a help page ?regexp explaining what regexps 
are.  Even 2.0.1 did, although you were asked to update *before* posting 
(see the posting guide).  It was unambiguous:

    A _character class_ is a list of characters enclosed by '[' and
    ']' matches any single character in that list ...
                    ^^^^^^
    ...  Note that alternation does not work inside character classes,
    where \code{|} has its literal meaning.

On Sun, 27 Nov 2005, John Logsdon wrote:

> Hello
>
> I know that R's string functions are not as extensive as those of Unix but
> I need to do some text handling totally within an R environment because
> the target is a Windows system which will not have the corresponding shell
> utilities, sed, awk etc.
> Can anyone explain the following gsub phenomenon to me:
>
>> dates<-c("73","74","02","1973","1974","2002")
>
> I want to take just the last two digits where it is a 4-digit year and
> both digits when it is a 2-digit year.  I should be able to use substr but
> measurement from the string end (with a negative counter or something) is
> not implemented:

Why 'should' it work in a different way to that documented?

>> substr(dates,3,4)
> [1] ""   ""   ""   "73" "74" "02"
>> substr(dates,-2,4)
> [1] "73"   "74"   "02"   "1973" "1974" "2002"
>> substr(dates,4,-2)
> [1] "" "" "" "" "" ""
>
> So I tried gsub:
>
>> gsub("[19|20]([0-9][0-9])","\\1",dates)
> [1] "73"  "74"  "02"  "973" "974" "002"
>
> As I understand it (and comparing with sed), the \\1 should take the first
> bracketed string but clearly this doesn't work.
> If I try what should also work:
>
>> gsub("[19|20]([0-9])([0-9])","\\1\\2",dates)
> [1] "73"  "74"  "02"  "973" "974" "002"

> On the other hand the following does work:
>
>> gsub("[19|20]([0-9])([0-9])","\\2",dates)
> [1] "73" "74" "02" "73" "74" "02"
>
> So it appears that the substitution takes one character extra to the left
> but the following indicates that the lower limit of the selected range is
> also at fault:
>> s<-c("1","12","123","1234","12345","123456")
>> gsub("[12]([4-6]*)","",s)
> [1] ""     ""     "3"    "34"   "345"  "3456"
>
> Probably more elegant examples could be constructed that could home in on
> the issue.
> The version is R 2.0.1 on Linux so perhaps it is a little old now.
>
> Questions:
>
> 1) Am I misunderstanding the gsub use?

Yes.

> 2) Was it a bug that has since been corrected?

Unfortunately the bug reported two years ago in

> library(fortunes); fortune("WTFM")

still seems extant.  See the posting guide for advice on how to correct 
it.

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595