[Rd] R 2.9.2 crashes when sorting latin1-encoded strings
Prof Brian Ripley
ripley at stats.ox.ac.uk
Mon Oct 5 16:16:09 CEST 2009
This was a missing PROTECT() in do_order.
But I'll echo what Simon Urbanek said: don't do that but rather use
the documented ways to re-encode the file as you read it. (Latin-1
used to be needed for collation on Mac OS X as C-level collation in
UTF-8 was completely broken -- but we have worked around that.)
We provided fileEncoding= in read.table for those who failed to RTFM
and thought encoding= was to set the file encoding, but it seems that
encodings are simply too hard a concept for some R users.
On Wed, 30 Sep 2009, Stefan Evert wrote:
> Hi everyone!
>
> I think I stumbled over a bug in the latest R 2.9.2 patched for OS X:
>
>> R version 2.9.2 Patched (2009-09-24 r49861)
>> i386-apple-darwin9.8.0
>
>
> When I try to sort latin1-encoded character vectors, R sometimes crashes with
> a segmentation fault. I'm running OS X 10.5.8 and have observed this
> behaviour both with the i386 and x86_64 builds, in the R.app GUI as well as
> on the command line.
>
> Here's a minimal example that reliably triggers the crash on my machine:
>
> =====
> print(sessionInfo())
>
> words <- c("aa", "ab", "a\xfc", "a\xe4", "b\xe4", "b\xfc", "\xe4\xfc")
> str(words)
>
> print(table(Encoding(words)))
> Encoding(words) <- "latin1" # this is the correct encoding!
> print(table(Encoding(words)))
>
> N <- 1000
> words <- rep(words, length.out=N)
>
> print(N)
> for (i in 1:N) {
> x <- words[1:i]
> # the following line will crash for some i, depending on the particular
> # strings in <words> and the subset selected for <x> above
> order(x)
> }
> =====
>
> The output I get from this code is appended at the end of the mail. Note that
> R incorrectly declares the latin1 strings in <word> to have UTF-8 encoding
> (this seems wrong to me because the \x escapes insert raw bytes into the
> string). The crash only occurs if the correct "latin1" encoding (or
> "unknown") is explicitly specified. Otherwise the string handling code
> appears to ignore everything after the first invalid multibyte character.
>
> I haven't been able to trigger the bug without some kind of loop. The crash
> always occurs at the same iteration, but this changes depending on the
> contents of <words> and the specific subset selected in each loop iteration.
> Also note that the 64-bit version of R gives a different error message. If I
> omit the unrelated statement "print(N)", the 64-bit version segfaults and the
> 32-bit version just hangs with high CPU load. All this suggests to me that
> there must be some insidious memory corruption or stack/range overflow in the
> internal ordering code.
>
> Can other people reproduce this problem on different platforms and possibly
> with different versions of R?
>
>
> BTW, I ran into the crash when trying to read.delim() a file in latin1
> encoding, using either encoding="latin1" or fileEncoding="latin1", and then
> converting it back and forth between a character vector and a factor. I
> still don't understand what's going on there. The behaviour of read.delim()
> seems to depend very much on my locale settings when running R, which is
> rather unpleasant. Is there a way to find out how strings are stored
> internally (i.e. getting the exact byte representation) and whether R
> believes them to be in UTF-8 or latin1 encoding?
>
>
> Best regards,
> Stefan Evert
>
> [ stefan.evert at uos.de | http://purl.org/stefan.evert ]
>
>
>
>
>
> Output of sample code on my machine:
>
>>> print(sessionInfo())
>> R version 2.9.2 Patched (2009-09-24 r49861)
>> i386-apple-darwin9.8.0
>>
>> locale:
>> en_GB/en_GB/C/C/en_GB/en_GB
>>
>> attached base packages:
>> [1] stats graphics grDevices utils datasets methods base
>>>
>>> words <- c("aa", "ab", "a\xfc", "a\xe4", "b\xe4", "b\xfc", "\xe4\xfc")
>>> str(words)
>> chr [1:7] "aa" "ab" "a\xfc" "a\xe4" "b\xe4" "b\xfc" ...
>>> print(table(Encoding(words)))
>>
>> unknown UTF-8
>> 2 5
>>>
>>> Encoding(words) <- "latin1" # this is the correct encoding!
>>> print(table(Encoding(words)))
>>
>> latin1 unknown
>> 5 2
>>>
>>> N <- 1000
>>> words <- rep(words, length.out=N)
>>>
>>> print(N)
>> [1] 1000
>>> for (i in 1:N) {
>> + x <- words[1:i]
>> + # the following line will crash for some i, depending on the particular
>> + # strings in <words> and the subset selected for <x> above
>> + order(x)
>> + }
>>
>> *** caught bus error ***
>> address 0x86, cause 'non-existent physical address'
>>
>> Traceback:
>> 1: order(x)
>> aborting ...
>> Bus error
>
> 64-bit version:
>
>>> print(sessionInfo())
>> R version 2.9.2 Patched (2009-09-24 r49861)
>> x86_64-apple-darwin9.8.0
>>
>> locale:
>> en_GB/en_GB/C/C/en_GB/en_GB
>>
>> attached base packages:
>> [1] stats graphics grDevices utils datasets methods base
>>>
>>> words <- c("aa", "ab", "a\xfc", "a\xe4", "b\xe4", "b\xfc", "\xe4\xfc")
>>> str(words)
>> chr [1:7] "aa" "ab" "a\xfc" "a\xe4" "b\xe4" "b\xfc" ...
>>> print(table(Encoding(words)))
>>
>> unknown UTF-8
>> 2 5
>>>
>>> Encoding(words) <- "latin1" # this is the correct encoding!
>>> print(table(Encoding(words)))
>>
>> latin1 unknown
>> 5 2
>>>
>>> N <- 1000
>>> words <- rep(words, length.out=N)
>>>
>>> print(N)
>> [1] 1000
>>> for (i in 1:N) {
>> + x <- words[1:i]
>> + # the following line will crash for some i, depending on the particular
>> + # strings in <words> and the subset selected for <x> above
>> + order(x)
>> + }
>> Error in order(x) : 'translateCharUTF8' must be called on a CHARSXP
>> Execution halted
>>
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
--
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
More information about the R-devel
mailing list