[R] gsub syntax

Gabor Grothendieck ggrothendieck at gmail.com
Sun Nov 27 15:50:30 CET 2005


On 11/27/05, John Logsdon <j.logsdon at quantex-research.com> wrote:
> Hello
>
> I know that R's string functions are not as extensive as those of Unix but

I don't think this statement is true although I have seen it repeated.

> I need to do some text handling totally within an R environment because
> the target is a Windows system which will not have the corresponding shell
> utilities, sed, awk etc.

Free versions of these utilities are available for Windows although they
don't come with Windows.  e.g. Google for gawk.

>
> Can anyone explain the following gsub phenomenon to me:
>
> > dates<-c("73","74","02","1973","1974","2002")
>
> I want to take just the last two digits where it is a 4-digit year and
> both digits when it is a 2-digit year.  I should be able to use substr but
> measurement from the string end (with a negative counter or something) is
> not implemented:
>
> > substr(dates,3,4)
> [1] ""   ""   ""   "73" "74" "02"
> > substr(dates,-2,4)
> [1] "73"   "74"   "02"   "1973" "1974" "2002"
> > substr(dates,4,-2)
> [1] "" "" "" "" "" ""
>
> So I tried gsub:
>
> > gsub("[19|20]([0-9][0-9])","\\1",dates)
> [1] "73"  "74"  "02"  "973" "974" "002"
>
> As I understand it (and comparing with sed), the \\1 should take the first
> bracketed string but clearly this doesn't work.  If I try what should also
> work:
>
> > gsub("[19|20]([0-9])([0-9])","\\1\\2",dates)
> [1] "73"  "74"  "02"  "973" "974" "002"
>
> On the other hand the following does work:
>
> > gsub("[19|20]([0-9])([0-9])","\\2",dates)
> [1] "73" "74" "02" "73" "74" "02"
>
> So it appears that the substitution takes one character extra to the left
> but the following indicates that the lower limit of the selected range is
> also at fault:
>
> > s<-c("1","12","123","1234","12345","123456")
> > gsub("[12]([4-6]*)","",s)
> [1] ""     ""     "3"    "34"   "345"  "3456"
>
> Probably more elegant examples could be constructed that could home in on
> the issue.
>
> The version is R 2.0.1 on Linux so perhaps it is a little old now.
>
> Questions:
>
> 1) Am I misunderstanding the gsub use?
>
> 2) Was it a bug that has since been corrected?
>
> 3) Is it still a bug in the latest version?
>

It works the same on my system which is 2.2.0 Windows patched
(2005-10-24). At first I too thought it was a bug but I noticed it
works the same in perl so now I am not sure. The following perl
program under Windows using perl 5.8.6 on Windows
gives 002 as the answer as the answer too:

   $_ = "2002";
   s/[19|20]([0-9])([0-9])/\1\2/g;
   print;

In any any case, it could be done like this:

   sub(".*(..)$", "\\1", dates)

or

   substring(dates, nchar(dates)-1)

or the following which appends -01-01 to the year, converts it to Date
class, implicitly converts it back to character and then extracts
the 3rd to 4th character of the result:

   substring(as.Date(sprintf("%s-01-01", dates)), 3, 4)

or




More information about the R-help mailing list