[R] more on paste and bug
Saikat DebRoy
saikat at stat.wisc.edu
Wed Oct 10 21:14:33 CEST 2001
>>>>> "Peter" == Peter Dalgaard BSA <p.dalgaard at biostat.ku.dk> writes:
Peter> Ott Toomet <siim at obs.ee> writes:
>> Hi,
>>
>> dput( ce0) gives a correct answer: > dput( ce0) c("1985", "9",
>> "2", "2", "1", "A", "1", "", "NA", "5", "1999" )
>>
>> The same does just print( ce0): > print( ce0) [1] "1985" "9" "2"
>> "2" "1" "A" "1" "" "NA" "5" [11] "1999"
>>
>> However, if I make a new similar vector ce0a: > ce0a <- c(
>> 1985,9,2,2,1,"A",1,"",NA,5,1999)
>>
>> Then the paste works correctly: > paste( ce0a, m, sep="",
>> collapse="") [1]
>> "1985<1>9<2>2<3>2<4>1<5>A<6>1<7><8>NA<9>5<0>1999END"
>>
>> I had M as > m [1] "<1>" "<2>" "<3>" "<4>" "<5>" "<6>" "<7>"
>> "<8>" "<9>" "<0>" "END"
>>
>> So I have two apparently similar vectors which behave differently
>> with paste: > paste( ce0a, m, sep="", collapse="") [1]
>> "1985<1>9<2>2<3>2<4>1<5>A<6>1<7><8>NA<9>5<0>1999END" > paste(
>> ce0, m, sep="", collapse="") [1]
>> "1985<1>9<2>2<3>2<4>1<5>A1<7>NA<9>5<0>1999END" > ce0a [1] "1985"
>> "9" "2" "2" "1" "A" "1" "" "NA" "5" [11] "1999" > ce0 [1] "1985"
>> "9" "2" "2" "1" "A" "1" "" "NA" "5" [11] "1999"
>>
>> I suggest there can be some hidden attributes somewhere in ce0
>> which I have not noticed (there seem not to be factors), the
>> problem seems to arise with the non-numerical columns (ce0 is
>> just part of one row of the big dataframe). Is it possible to
>> figure it out, and possible change? At least attributes() do
>> show nothing: > attributes(ce0) NULL > attributes(ce0a) NULL
Peter> Hmmm. The plot would seem to thicken around the entries in
Peter> ce0 corresponding to <6> and <8>. If these accidentally
Peter> contain \0 characters, much would be explained. Maybe also
Peter> other weird characters.
As it happens, I think the problem is in the read.dta code. The relevant
piece of code is in foreign/src/stataread.c (lines 317-324):
default:
charlen=INTEGER(types)[j]-STATA_STRINGOFFSET;
PROTECT(tmp=allocString(charlen+1));
InStringBinary(fp,charlen,CHAR(tmp));
CHAR(tmp)[charlen]=0;
SET_STRING_ELT(VECTOR_ELT(df,j),i,tmp);
UNPROTECT(1);
break;
As it happens, in this case the string "A" is written in the file
as two bytes (I do not not know why) with the second byte being '\0'.
So the above code creates a CHARSXP of length 3 with last two bytes
being '\0'.
Peter> What happens if you do nchar(ce0) ? What if you omit the
Peter> collapse= argument?
nchar uses strlen - so it would return the length as 1.
By the way, by looking at the code for mkChar and paste, it seems that
R is _not_ storing null terminated strings - mkChar only allocates
storage for strlen(name) and not strlen(name)+1 and paste uses LENGTH
to get the string length. At the same time strlen is used in
do_nchar. Could there be a potential problem here? Maybe you should
use strnlen in do_nchar?
Saikat
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
More information about the R-help
mailing list