[Rd] R 2.9.2 crashes when sorting latin1-encoded strings
Stefan Evert
stefan.evert at uos.de
Wed Sep 30 11:11:48 CEST 2009
Hi everyone!
I think I stumbled over a bug in the latest R 2.9.2 patched for OS X:
> R version 2.9.2 Patched (2009-09-24 r49861)
> i386-apple-darwin9.8.0
When I try to sort latin1-encoded character vectors, R sometimes
crashes with a segmentation fault. I'm running OS X 10.5.8 and have
observed this behaviour both with the i386 and x86_64 builds, in the
R.app GUI as well as on the command line.
Here's a minimal example that reliably triggers the crash on my machine:
=====
print(sessionInfo())
words <- c("aa", "ab", "a\xfc", "a\xe4", "b\xe4", "b\xfc", "\xe4\xfc")
str(words)
print(table(Encoding(words)))
Encoding(words) <- "latin1" # this is the correct encoding!
print(table(Encoding(words)))
N <- 1000
words <- rep(words, length.out=N)
print(N)
for (i in 1:N) {
x <- words[1:i]
# the following line will crash for some i, depending on the particular
# strings in <words> and the subset selected for <x> above
order(x)
}
=====
The output I get from this code is appended at the end of the mail.
Note that R incorrectly declares the latin1 strings in <word> to have
UTF-8 encoding (this seems wrong to me because the \x escapes insert
raw bytes into the string). The crash only occurs if the correct
"latin1" encoding (or "unknown") is explicitly specified. Otherwise
the string handling code appears to ignore everything after the first
invalid multibyte character.
I haven't been able to trigger the bug without some kind of loop. The
crash always occurs at the same iteration, but this changes depending
on the contents of <words> and the specific subset selected in each
loop iteration. Also note that the 64-bit version of R gives a
different error message. If I omit the unrelated statement
"print(N)", the 64-bit version segfaults and the 32-bit version just
hangs with high CPU load. All this suggests to me that there must be
some insidious memory corruption or stack/range overflow in the
internal ordering code.
Can other people reproduce this problem on different platforms and
possibly with different versions of R?
BTW, I ran into the crash when trying to read.delim() a file in latin1
encoding, using either encoding="latin1" or fileEncoding="latin1", and
then converting it back and forth between a character vector and a
factor. I still don't understand what's going on there. The
behaviour of read.delim() seems to depend very much on my locale
settings when running R, which is rather unpleasant. Is there a way
to find out how strings are stored internally (i.e. getting the exact
byte representation) and whether R believes them to be in UTF-8 or
latin1 encoding?
Best regards,
Stefan Evert
[ stefan.evert at uos.de | http://purl.org/stefan.evert ]
Output of sample code on my machine:
> > print(sessionInfo())
> R version 2.9.2 Patched (2009-09-24 r49861)
> i386-apple-darwin9.8.0
>
> locale:
> en_GB/en_GB/C/C/en_GB/en_GB
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
> >
> > words <- c("aa", "ab", "a\xfc", "a\xe4", "b\xe4", "b\xfc",
> "\xe4\xfc")
> > str(words)
> chr [1:7] "aa" "ab" "a\xfc" "a\xe4" "b\xe4" "b\xfc" ...
> > print(table(Encoding(words)))
>
> unknown UTF-8
> 2 5
> >
> > Encoding(words) <- "latin1" # this is the correct encoding!
> > print(table(Encoding(words)))
>
> latin1 unknown
> 5 2
> >
> > N <- 1000
> > words <- rep(words, length.out=N)
> >
> > print(N)
> [1] 1000
> > for (i in 1:N) {
> + x <- words[1:i]
> + # the following line will crash for some i, depending on the
> particular
> + # strings in <words> and the subset selected for <x> above
> + order(x)
> + }
>
> *** caught bus error ***
> address 0x86, cause 'non-existent physical address'
>
> Traceback:
> 1: order(x)
> aborting ...
> Bus error
64-bit version:
> > print(sessionInfo())
> R version 2.9.2 Patched (2009-09-24 r49861)
> x86_64-apple-darwin9.8.0
>
> locale:
> en_GB/en_GB/C/C/en_GB/en_GB
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
> >
> > words <- c("aa", "ab", "a\xfc", "a\xe4", "b\xe4", "b\xfc",
> "\xe4\xfc")
> > str(words)
> chr [1:7] "aa" "ab" "a\xfc" "a\xe4" "b\xe4" "b\xfc" ...
> > print(table(Encoding(words)))
>
> unknown UTF-8
> 2 5
> >
> > Encoding(words) <- "latin1" # this is the correct encoding!
> > print(table(Encoding(words)))
>
> latin1 unknown
> 5 2
> >
> > N <- 1000
> > words <- rep(words, length.out=N)
> >
> > print(N)
> [1] 1000
> > for (i in 1:N) {
> + x <- words[1:i]
> + # the following line will crash for some i, depending on the
> particular
> + # strings in <words> and the subset selected for <x> above
> + order(x)
> + }
> Error in order(x) : 'translateCharUTF8' must be called on a CHARSXP
> Execution halted
>
More information about the R-devel
mailing list