[Rd] Embedded nuls in strings
Duncan Murdoch
murdoch at stats.uwo.ca
Wed Aug 8 03:30:11 CEST 2007
On 07/08/2007 9:13 PM, Herve Pages wrote:
> Duncan Murdoch wrote:
>> On 07/08/2007 6:29 PM, Herve Pages wrote:
> [...]
>>> Same for serialization:
>>>
>>>> save(string0, file="string0.rda")
>>>> load("string0.rda")
>>>> string0
>>> [1] "ABCD"
>> Of these, I'd say the serialization is the only case where it would be
>> reasonable to fix the behaviour. R depends on C run-time functions for
>> most of the string operations, and they'll stop at a null. So if this
>> isn't documented behaviour, it should be, but it's not reasonable to
>> rewrite the C run-time string functions just to handle such weird
>> objects. Functions like "grep" require thousands of lines of code, not
>> written by us, and in my opinion maintaining changes to it is not
>> something the R project should take on.
>
> I was not (of course) suggesting to fix all the string manipulation functions.
> I'm just wondering why R would try to support embedded nuls in the first
> place given that they can only be a source of troubles.
I think this predates raw vectors, so this would have been the only way
to handle strings with embedded nulls. C has problems with those, but
not all other languages do.
>
> What about this:
>
> > string0
> [1] "ABCD\0F"
> > string0 == "ABCD"
> [1] TRUE
>
> string0 is obviously different from "ABCD"!
This is documented behaviour, from ?Comparison:
"When comparisons are made between character strings, parts of the
strings after embedded 'nul' characters are ignored. (This is
necessary as the position of 'nul' in the collation sequence is
undefined, and we want one of '<', '==' and '>' to be true for any
comparison.)"
But notice
> identical(string0, "ABCD")
[1] FALSE
This is documented as
"Comparison of character strings allows for embedded 'nul'
characters."
Duncan Murdoch
>
> Maybe it's easier to change the semantic of rawToChar() so it doesn't return
> a string with embedded nuls. More generally speaking, base functions should
> always return "clean" strings.
>
>> As to serialization: there's a comment in the source that embedded
>> nulls are handled by it, and that's true up to R-patched, but not in
>> R-devel. Looks like someone has introduced a bug.
>>
>> Duncan Murdoch
>>> One comment about the nchar man page:
>>> 'chars' The number of human-readable characters.
>>>
>>> "human-readable" seems to be used for "everything but a nul" here
>>> which can be confusing.
>>> For example one would generally think of ascii codes 1 to 31 as non
>>> "human-readable" but
>>> nchar() seems to disagree:
>>>
>>>> string1 <- rawToChar(as.raw(1:31))
>>>> string1
>>> [1]
>>> "\001\002\003\004\005\006\a\b\t\n\v\f\r\016\017\020\021\022\023\024\025\026\027\030\031\032\033\034\035\036\037"
>>>
>>>> nchar(string1, type="chars")
>>> [1] 31
>> No, "human-readable" also has other meanings in multi-byte encodings. If
>> an e-acute is encoded in two bytes in your locale, it still only counts
>> as one human-readable character.
>>
More information about the R-devel
mailing list