[Rd] Embedded nuls in strings
Duncan Murdoch
murdoch at stats.uwo.ca
Wed Aug 8 02:10:27 CEST 2007
On 07/08/2007 6:29 PM, Herve Pages wrote:
> Duncan Murdoch wrote:
>> On 07/08/2007 5:06 PM, Herve Pages wrote:
>>> Hi,
>>>
>>> ?rawToChar
>>> 'rawToChar' converts raw bytes either to a single character string
>>> or a character vector of single bytes. (Note that a single
>>> character string could contain embedded nuls.)
>>>
>>> Allowing embedded nuls in a string might be an interesting experiment
>>> but it
>>> seems to cause some troubles to most of the string manipulation
>>> functions.
>>>
>>> A string with an embedded 0:
>>>
>>> raw0 <- as.raw(c(65:68, 0 , 70))
>>> string0 <- rawToChar(raw0)
>>>
>>>> string0
>>> [1] "ABCD\0F"
>>>
>>> nchar() should return 6:
>>>> nchar(string0)
>>> [1] 4
>> You don't state your R version. The default type of counting in nchar()
>> has recently changed from "bytes" (where 6 is correct) to "chars" (where
>> 4 is correct).
>
>
> Oops, sorry:
>
>> sessionInfo()
> R version 2.6.0 Under development (unstable) (2007-07-02 r42107)
> x86_64-unknown-linux-gnu
>
> locale:
> LC_CTYPE=en_US;LC_NUMERIC=C;LC_TIME=en_US;LC_COLLATE=en_US;LC_MONETARY=en_US;LC_MESSAGES=en_US;LC_PAPER=en_US;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US;LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> loaded via a namespace (and not attached):
> [1] rcompgen_0.1-15
>
>
> And indeed:
> raw0 <- as.raw(c(65:68, 0 , 70))
> string0 <- rawToChar(raw0)
>
>> nchar(string0, type="chars")
> [1] 4
>> nchar(string0, type="bytes")
> [1] 6
>
>
> In addition to the string functions already mentioned before, it's worth noting that
> 'paste' doesn't seem to be "embedded nul aware" neither:
>
>> paste(string0, "G", sep="")
> [1] "ABCDG"
>
> Same for serialization:
>
>> save(string0, file="string0.rda")
>> load("string0.rda")
>> string0
> [1] "ABCD"
Of these, I'd say the serialization is the only case where it would be
reasonable to fix the behaviour. R depends on C run-time functions for
most of the string operations, and they'll stop at a null. So if this
isn't documented behaviour, it should be, but it's not reasonable to
rewrite the C run-time string functions just to handle such weird
objects. Functions like "grep" require thousands of lines of code, not
written by us, and in my opinion maintaining changes to it is not
something the R project should take on.
As to serialization: there's a comment in the source that embedded
nulls are handled by it, and that's true up to R-patched, but not in
R-devel. Looks like someone has introduced a bug.
Duncan Murdoch
>
> One comment about the nchar man page:
> 'chars' The number of human-readable characters.
>
> "human-readable" seems to be used for "everything but a nul" here which can be confusing.
> For example one would generally think of ascii codes 1 to 31 as non "human-readable" but
> nchar() seems to disagree:
>
>> string1 <- rawToChar(as.raw(1:31))
>> string1
> [1]
> "\001\002\003\004\005\006\a\b\t\n\v\f\r\016\017\020\021\022\023\024\025\026\027\030\031\032\033\034\035\036\037"
>> nchar(string1, type="chars")
> [1] 31
No, "human-readable" also has other meanings in multi-byte encodings.
If an e-acute is encoded in two bytes in your locale, it still only
counts as one human-readable character.
More information about the R-devel
mailing list