[R] embedded nuls in 2.10 versus 2.11

Brandon Whitcher bwhitcher at gmail.com
Tue Mar 2 09:33:23 CET 2010


I have been reading binary files, and parsing the output, for some
time now.  I have tried to develop a technique that is as robust as
possible to all the strange things that appear in text fields, not to
mention different global/regional encodings.  I have no control over
the data generated by users, so I would like to be as flexible and
accommodating as possible.  The following code is straightforward, but
will fail with embedded nuls in R <= 2.10

fid = open(filename, "rb")
readChar(fid, n=10)
close(fid)

Previous suggestions from the R-help list led me to consider

fid = open(filename, "rb")
rawToChar(readBin(fid, "raw", 10))
close(fid)

or even

fid = open(filename, "rb")
iconv(rawToChar(readBin(fid, "raw", 10)), to="UTF-8")
close(fid)

to ensure that my output is "well behaved".  With the new error
handling in rawToChar() in R = 2.11, embedded nuls are no longer
allowed except at the end of the string.  I run across these all the
time in my user data.  How can I recover as much of the text as
possible when reading in from a binary file with embedded nuls in R >=
2.11 and keep the code backwards compatible with R < 2.11?

thanks...

Brandon



More information about the R-help mailing list