[Rd] R 2.9.2 crashes when sorting latin1-encoded strings

Mon Oct 5 16:16:09 CEST 2009

This was a missing PROTECT() in do_order.

But I'll echo what Simon Urbanek said: don't do that but rather use 
the documented ways to re-encode the file as you read it.  (Latin-1 
used to be needed for collation on Mac OS X as C-level collation in 
UTF-8 was completely broken -- but we have worked around that.)

We provided fileEncoding= in read.table for those who failed to RTFM 
and thought encoding= was to set the file encoding, but it seems that 
encodings are simply too hard a concept for some R users.

On Wed, 30 Sep 2009, Stefan Evert wrote:

> Hi everyone!
>
> I think I stumbled over a bug in the latest R 2.9.2 patched for OS X:
>
>> R version 2.9.2 Patched (2009-09-24 r49861)
>> i386-apple-darwin9.8.0
>
>
> When I try to sort latin1-encoded character vectors, R sometimes crashes with 
> a segmentation fault.  I'm running OS X 10.5.8 and have observed this
> behaviour both with the i386 and x86_64 builds, in the R.app GUI as well as 
> on the command line.
>
> Here's a minimal example that reliably triggers the crash on my machine:
>
> =====
> print(sessionInfo())
>
> words <- c("aa", "ab", "a\xfc", "a\xe4", "b\xe4", "b\xfc", "\xe4\xfc")
> str(words)
>
> print(table(Encoding(words)))
> Encoding(words) <- "latin1"  # this is the correct encoding!
> print(table(Encoding(words)))
>
> N <- 1000
> words <- rep(words, length.out=N)
>
> print(N)
> for (i in 1:N) {
> x <- words[1:i]
> # the following line will crash for some i, depending on the particular
> # strings in <words> and the subset selected for <x> above
> order(x)
> }
> =====
>
> The output I get from this code is appended at the end of the mail. Note that 
> R incorrectly declares the latin1 strings in <word> to have UTF-8 encoding 
> (this seems wrong to me because the \x escapes insert raw bytes into the 
> string). The crash only occurs if the correct "latin1" encoding (or 
> "unknown") is explicitly specified. Otherwise the string handling code 
> appears to ignore everything after the first invalid multibyte character.
>
> I haven't been able to trigger the bug without some kind of loop.  The crash 
> always occurs at the same iteration, but this changes depending on the 
> contents of <words> and the specific subset selected in each loop iteration. 
> Also note that the 64-bit version of R gives a different error message.  If I 
> omit the unrelated statement "print(N)", the 64-bit version segfaults and the 
> 32-bit version just hangs with high CPU load. All this suggests to me that 
> there must be some insidious memory corruption or stack/range overflow in the 
> internal ordering code.
>
> Can other people reproduce this problem on different platforms and possibly 
> with different versions of R?
>
>
> BTW, I ran into the crash when trying to read.delim() a file in latin1 
> encoding, using either encoding="latin1" or fileEncoding="latin1", and then 
> converting it back and forth between a character vector and a factor.  I 
> still don't understand what's going on there.  The behaviour of read.delim() 
> seems to depend very much on my locale settings when running R, which is 
> rather unpleasant.  Is there a way to find out how strings are stored 
> internally (i.e. getting the exact byte representation) and whether R 
> believes them to be in UTF-8 or latin1 encoding?
>
>
> Best regards,
> Stefan Evert
>
> [ stefan.evert at uos.de | http://purl.org/stefan.evert ]
>
>
>
>
>
> Output of sample code on my machine:
>
>>> print(sessionInfo())
>> R version 2.9.2 Patched (2009-09-24 r49861)
>> i386-apple-darwin9.8.0
>> 
>> locale:
>> en_GB/en_GB/C/C/en_GB/en_GB
>> 
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>> 
>>> words <- c("aa", "ab", "a\xfc", "a\xe4", "b\xe4", "b\xfc", "\xe4\xfc")
>>> str(words)
>> chr [1:7] "aa" "ab" "a\xfc" "a\xe4" "b\xe4" "b\xfc" ...
>>> print(table(Encoding(words)))
>> 
>> unknown   UTF-8
>>    2       5
>>> 
>>> Encoding(words) <- "latin1"  # this is the correct encoding!
>>> print(table(Encoding(words)))
>> 
>> latin1 unknown
>>    5       2
>>> 
>>> N <- 1000
>>> words <- rep(words, length.out=N)
>>> 
>>> print(N)
>> [1] 1000
>>> for (i in 1:N) {
>> +   x <- words[1:i]
>> +   # the following line will crash for some i, depending on the particular
>> +   # strings in <words> and the subset selected for <x> above
>> +   order(x)
>> + }
>> 
>> *** caught bus error ***
>> address 0x86, cause 'non-existent physical address'
>> 
>> Traceback:
>> 1: order(x)
>> aborting ...
>> Bus error
>
> 64-bit version:
>
>>> print(sessionInfo())
>> R version 2.9.2 Patched (2009-09-24 r49861)
>> x86_64-apple-darwin9.8.0
>> 
>> locale:
>> en_GB/en_GB/C/C/en_GB/en_GB
>> 
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>> 
>>> words <- c("aa", "ab", "a\xfc", "a\xe4", "b\xe4", "b\xfc", "\xe4\xfc")
>>> str(words)
>> chr [1:7] "aa" "ab" "a\xfc" "a\xe4" "b\xe4" "b\xfc" ...
>>> print(table(Encoding(words)))
>> 
>> unknown   UTF-8
>>    2       5
>>> 
>>> Encoding(words) <- "latin1"  # this is the correct encoding!
>>> print(table(Encoding(words)))
>> 
>> latin1 unknown
>>    5       2
>>> 
>>> N <- 1000
>>> words <- rep(words, length.out=N)
>>> 
>>> print(N)
>> [1] 1000
>>> for (i in 1:N) {
>> +   x <- words[1:i]
>> +   # the following line will crash for some i, depending on the particular
>> +   # strings in <words> and the subset selected for <x> above
>> +   order(x)
>> + }
>> Error in order(x) : 'translateCharUTF8' must be called on a CHARSXP
>> Execution halted
>> 
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595