[R] gsub syntax
Prof Brian Ripley
ripley at stats.ox.ac.uk
Sun Nov 27 17:41:40 CET 2005
R is blameless here: it works as documented and in the same way as
POSIX tools. It agrees with 'sed' using the same syntax (modulo the
shell-specific quoting rules) e.g. in csh
% echo 1973 | sed 's/[19|20]\([0-9][0-9]\)/\1/g'
973
% echo 1973 | sed 's/\([19|20]\)\([0-9][0-9]\)/-\1-\2-/g'
-1-97-3
% echo "73 74 02 1973 1974 2002" | sed 's/[19|20]\([0-9][0-9]\)/\1/g'
73 74 02 973 974 002
so what happened when you were 'comparing with sed'?
"[19|20]" is a character class (containing five characters) matching one
character, not a match for two characters as you seem to imagine. It does
not mean the same as "19|20", which is what you seem to have intended (and
you seem only to want to do the substitution once on each string, so why
use gsub?):
> sub("19|20([0-9][0-9])", "\\1", dates)
[1] "73" "74" "02" "73" "74" "02"
A more direct way which would work e.g. for 1837 would be
sub(".*([0-9]{2}$)", "\\1", dates)
or even better (locale-independent)
sub(".*([[:digit:]]{2}$)", "\\1", dates)
Current versions of R have a help page ?regexp explaining what regexps
are. Even 2.0.1 did, although you were asked to update *before* posting
(see the posting guide). It was unambiguous:
A _character class_ is a list of characters enclosed by '[' and
']' matches any single character in that list ...
^^^^^^
... Note that alternation does not work inside character classes,
where \code{|} has its literal meaning.
On Sun, 27 Nov 2005, John Logsdon wrote:
> Hello
>
> I know that R's string functions are not as extensive as those of Unix but
> I need to do some text handling totally within an R environment because
> the target is a Windows system which will not have the corresponding shell
> utilities, sed, awk etc.
> Can anyone explain the following gsub phenomenon to me:
>
>> dates<-c("73","74","02","1973","1974","2002")
>
> I want to take just the last two digits where it is a 4-digit year and
> both digits when it is a 2-digit year. I should be able to use substr but
> measurement from the string end (with a negative counter or something) is
> not implemented:
Why 'should' it work in a different way to that documented?
>> substr(dates,3,4)
> [1] "" "" "" "73" "74" "02"
>> substr(dates,-2,4)
> [1] "73" "74" "02" "1973" "1974" "2002"
>> substr(dates,4,-2)
> [1] "" "" "" "" "" ""
>
> So I tried gsub:
>
>> gsub("[19|20]([0-9][0-9])","\\1",dates)
> [1] "73" "74" "02" "973" "974" "002"
>
> As I understand it (and comparing with sed), the \\1 should take the first
> bracketed string but clearly this doesn't work.
> If I try what should also work:
>
>> gsub("[19|20]([0-9])([0-9])","\\1\\2",dates)
> [1] "73" "74" "02" "973" "974" "002"
> On the other hand the following does work:
>
>> gsub("[19|20]([0-9])([0-9])","\\2",dates)
> [1] "73" "74" "02" "73" "74" "02"
>
> So it appears that the substitution takes one character extra to the left
> but the following indicates that the lower limit of the selected range is
> also at fault:
>> s<-c("1","12","123","1234","12345","123456")
>> gsub("[12]([4-6]*)","",s)
> [1] "" "" "3" "34" "345" "3456"
>
> Probably more elegant examples could be constructed that could home in on
> the issue.
> The version is R 2.0.1 on Linux so perhaps it is a little old now.
>
> Questions:
>
> 1) Am I misunderstanding the gsub use?
Yes.
> 2) Was it a bug that has since been corrected?
Unfortunately the bug reported two years ago in
> library(fortunes); fortune("WTFM")
still seems extant. See the posting guide for advice on how to correct
it.
--
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
More information about the R-help
mailing list