[Rd] Embedded nuls in strings
Herve Pages
hpages at fhcrc.org
Wed Aug 8 00:29:16 CEST 2007
Duncan Murdoch wrote:
> On 07/08/2007 5:06 PM, Herve Pages wrote:
>> Hi,
>>
>> ?rawToChar
>> 'rawToChar' converts raw bytes either to a single character string
>> or a character vector of single bytes. (Note that a single
>> character string could contain embedded nuls.)
>>
>> Allowing embedded nuls in a string might be an interesting experiment
>> but it
>> seems to cause some troubles to most of the string manipulation
>> functions.
>>
>> A string with an embedded 0:
>>
>> raw0 <- as.raw(c(65:68, 0 , 70))
>> string0 <- rawToChar(raw0)
>>
>>> string0
>> [1] "ABCD\0F"
>>
>> nchar() should return 6:
>>> nchar(string0)
>> [1] 4
>
> You don't state your R version. The default type of counting in nchar()
> has recently changed from "bytes" (where 6 is correct) to "chars" (where
> 4 is correct).
Oops, sorry:
> sessionInfo()
R version 2.6.0 Under development (unstable) (2007-07-02 r42107)
x86_64-unknown-linux-gnu
locale:
LC_CTYPE=en_US;LC_NUMERIC=C;LC_TIME=en_US;LC_COLLATE=en_US;LC_MONETARY=en_US;LC_MESSAGES=en_US;LC_PAPER=en_US;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US;LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] rcompgen_0.1-15
And indeed:
raw0 <- as.raw(c(65:68, 0 , 70))
string0 <- rawToChar(raw0)
> nchar(string0, type="chars")
[1] 4
> nchar(string0, type="bytes")
[1] 6
In addition to the string functions already mentioned before, it's worth noting that
'paste' doesn't seem to be "embedded nul aware" neither:
> paste(string0, "G", sep="")
[1] "ABCDG"
Same for serialization:
> save(string0, file="string0.rda")
> load("string0.rda")
> string0
[1] "ABCD"
One comment about the nchar man page:
'chars' The number of human-readable characters.
"human-readable" seems to be used for "everything but a nul" here which can be confusing.
For example one would generally think of ascii codes 1 to 31 as non "human-readable" but
nchar() seems to disagree:
> string1 <- rawToChar(as.raw(1:31))
> string1
[1]
"\001\002\003\004\005\006\a\b\t\n\v\f\r\016\017\020\021\022\023\024\025\026\027\030\031\032\033\034\035\036\037"
> nchar(string1, type="chars")
[1] 31
Cheers,
H.
>
> Duncan Murdoch
>
More information about the R-devel
mailing list