[R] more on paste and bug

Wed Oct 10 21:28:39 CEST 2001

On Wed, Oct 10, 2001 at 02:14:33PM -0500, Saikat DebRoy wrote:
> >>>>> "Peter" == Peter Dalgaard BSA <p.dalgaard at biostat.ku.dk> writes:
> 
>   Peter> Ott Toomet <siim at obs.ee> writes:
>   >> Hi,
>   >> 
>   >> dput( ce0) gives a correct answer: > dput( ce0) c("1985", "9",
>   >> "2", "2", "1", "A", "1", "", "NA", "5", "1999" )
>   >> 
>   >> The same does just print( ce0): > print( ce0) [1] "1985" "9" "2"
>   >> "2" "1" "A" "1" "" "NA" "5" [11] "1999"
>   >> 
>   >> However, if I make a new similar vector ce0a: > ce0a <- c(
>   >> 1985,9,2,2,1,"A",1,"",NA,5,1999)
>   >> 
>   >> Then the paste works correctly: > paste( ce0a, m, sep="",
>   >> collapse="") [1]
>   >> "1985<1>9<2>2<3>2<4>1<5>A<6>1<7><8>NA<9>5<0>1999END"
>   >> 
>   >> I had M as > m [1] "<1>" "<2>" "<3>" "<4>" "<5>" "<6>" "<7>"
>   >> "<8>" "<9>" "<0>" "END"
>   >> 
>   >> So I have two apparently similar vectors which behave differently
>   >> with paste: > paste( ce0a, m, sep="", collapse="") [1]
>   >> "1985<1>9<2>2<3>2<4>1<5>A<6>1<7><8>NA<9>5<0>1999END" > paste(
>   >> ce0, m, sep="", collapse="") [1]
>   >> "1985<1>9<2>2<3>2<4>1<5>A1<7>NA<9>5<0>1999END" > ce0a [1] "1985"
>   >> "9" "2" "2" "1" "A" "1" "" "NA" "5" [11] "1999" > ce0 [1] "1985"
>   >> "9" "2" "2" "1" "A" "1" "" "NA" "5" [11] "1999"
>   >> 
>   >> I suggest there can be some hidden attributes somewhere in ce0
>   >> which I have not noticed (there seem not to be factors), the
>   >> problem seems to arise with the non-numerical columns (ce0 is
>   >> just part of one row of the big dataframe).  Is it possible to
>   >> figure it out, and possible change?  At least attributes() do
>   >> show nothing: > attributes(ce0) NULL > attributes(ce0a) NULL
> 
>   Peter> Hmmm. The plot would seem to thicken around the entries in
>   Peter> ce0 corresponding to <6> and <8>. If these accidentally
>   Peter> contain \0 characters, much would be explained. Maybe also
>   Peter> other weird characters.
> 
> As it happens, I think the problem is in the read.dta code. The relevant
> piece of code is in foreign/src/stataread.c (lines 317-324):
> 
> 	    default:
> 	        charlen=INTEGER(types)[j]-STATA_STRINGOFFSET;
> 	        PROTECT(tmp=allocString(charlen+1));
> 		InStringBinary(fp,charlen,CHAR(tmp));
> 		CHAR(tmp)[charlen]=0;
> 		SET_STRING_ELT(VECTOR_ELT(df,j),i,tmp);
> 		UNPROTECT(1);
> 	      break;
> 
> As it happens, in this case the string "A" is written in the file
> as two bytes (I do not not know why) with the second byte being '\0'.
> So the above code creates a CHARSXP of length 3 with last two bytes
> being '\0'.
> 
>   Peter> What happens if you do nchar(ce0) ?  What if you omit the
>   Peter> collapse= argument?
> 
> nchar uses strlen - so it would return the length as 1.
> 
> By the way, by looking at the code for mkChar and paste, it seems that
> R is _not_ storing null terminated strings - mkChar only allocates
> storage for strlen(name) and not strlen(name)+1 and paste uses LENGTH
> to get the string length. At the same time strlen is used in
> do_nchar. Could there be a potential problem here? Maybe you should
> use strnlen in do_nchar?
> 

It's OK--mkChar calls allocString, which calls allocVector(CHARSXP, n)
which does size = BYTE2VEC(length + 1) -- that is where the space for
the null gets tacked on.

luke

-- 
Luke Tierney
University of Minnesota                      Phone:           612-625-7843
School of Statistics                         Fax:             612-624-8868
313 Ford Hall, 224 Church St. S.E.           email:      luke at stat.umn.edu
Minneapolis, MN 55455 USA                    WWW:  http://www.stat.umn.edu
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._