[R] puzzle using gsub (and encodings maybe)

Duncan Murdoch murdoch at stats.uwo.ca
Wed Oct 14 20:31:37 CEST 2009


On 10/14/2009 2:29 PM, Adrian Dragulescu wrote:
> Thank you.
> 
> If I use
>>gsub(" \xad", "-", x)
> [1] "NEW YORK-NEW ENGLAND"
> 
> I get what I want.

Right, that's simpler than what I suggested.

Duncan Murdoch

> 
> Adrian
> 
>> sessionInfo()
> R version 2.9.2 (2009-08-24)
> i386-pc-mingw32
> 
> locale:
> LC_COLLATE=English_United States.1252;LC_CTYPE=English_United 
> States.1252;LC_MONETARY=English_United 
> States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
> 
> 
> On Wed, 14 Oct 2009, Prof Brian Ripley wrote:
> 
>> On Wed, 14 Oct 2009, Adrian Dragulescu wrote:
>>
>>>> charToRaw(x)
>>> [1] 4e 45 57 20 59 4f 52 4b 20 ad 4e 45 57 20 45 4e 47 4c 41 4e 44
>>>> charToRaw(y)
>>> [1] 4e 45 57 20 59 4f 52 4b 20 2d 4e 45 57 20 45 4e 47 4c 41 4e 44
>>>> 
>>> 
>>> So they are different.
>>
>> We really do need the 'at a minimum' information we asked you for in the 
>> posting guide.  But in cp1252 (a guess as to what you might be using) \xad is 
>> a 'soft hyphen', and that is not the same thing as a hyphen -- you will get 
>> the same issues with 'non-breaking space'.
>>
>> BDR
>>
>>> 
>>> Adrian
>>> 
>>> I use R 2.8.1 on WinXP
>>> 
>>> 
>>> On Wed, 14 Oct 2009, Duncan Murdoch wrote:
>>> 
>>>> On 10/14/2009 1:30 PM, Adrian Dragulescu wrote:
>>>>> Hello,
>>>>> 
>>>>> Below is some output that shows my issue.
>>>>> 
>>>>> I have a variable x that I read from a file (more on this below)
>>>>> 
>>>>>> x
>>>>> [1] "NEW YORK NEW ENGLAND"
>>>>>> gsub(" -", "-", x)            # this does not work!
>>>>> [1] "NEW YORK NEW ENGLAND"
>>
>> Well, I see no hyphen at all here, but then I am not on Windows.
>>
>>>> It looks as though it worked, presumably because something got lost in 
>>>> your email.
>>>> 
>>>> Could you post charToRaw(x) so we can see what's in x?
>>>> 
>>>> Duncan Murdoch
>>>> 
>>>>>> Encoding(x)                   # is x in a special encoding? no
>>>>> [1] "unknown"
>>>>>> y = "NEW YORK -NEW ENGLAND"   # I type in variable y
>>>>>> gsub(" -", "-", y)            # and gsub works as expected
>>>>> [1] "NEW YORK-NEW ENGLAND"
>>>>>> 
>>>>> 
>>>>> I'm sure the problem has to do with the way I read the variable x.  But 
>>>>> even if I change the encoding for x to ASCII, I still cannot do the sub.
>>>>> I get x by reading a pdf file with pdftotext so you will not be able to 
>>>>> replicate my issue.
>>>>> 
>>>>> Thanks for any suggestions,
>>>>> Adrian
>>
>> -- 
>> Brian D. Ripley,                  ripley at stats.ox.ac.uk
>> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
>> University of Oxford,             Tel:  +44 1865 272861 (self)
>> 1 South Parks Road,                     +44 1865 272866 (PA)
>> Oxford OX1 3TG, UK                Fax:  +44 1865 272595
>>




More information about the R-help mailing list