[R] more on paste and bug
Luke Tierney
luke at nokomis.stat.umn.edu
Wed Oct 10 21:28:39 CEST 2001
On Wed, Oct 10, 2001 at 02:14:33PM -0500, Saikat DebRoy wrote:
> >>>>> "Peter" == Peter Dalgaard BSA <p.dalgaard at biostat.ku.dk> writes:
>
> Peter> Ott Toomet <siim at obs.ee> writes:
> >> Hi,
> >>
> >> dput( ce0) gives a correct answer: > dput( ce0) c("1985", "9",
> >> "2", "2", "1", "A", "1", "", "NA", "5", "1999" )
> >>
> >> The same does just print( ce0): > print( ce0) [1] "1985" "9" "2"
> >> "2" "1" "A" "1" "" "NA" "5" [11] "1999"
> >>
> >> However, if I make a new similar vector ce0a: > ce0a <- c(
> >> 1985,9,2,2,1,"A",1,"",NA,5,1999)
> >>
> >> Then the paste works correctly: > paste( ce0a, m, sep="",
> >> collapse="") [1]
> >> "1985<1>9<2>2<3>2<4>1<5>A<6>1<7><8>NA<9>5<0>1999END"
> >>
> >> I had M as > m [1] "<1>" "<2>" "<3>" "<4>" "<5>" "<6>" "<7>"
> >> "<8>" "<9>" "<0>" "END"
> >>
> >> So I have two apparently similar vectors which behave differently
> >> with paste: > paste( ce0a, m, sep="", collapse="") [1]
> >> "1985<1>9<2>2<3>2<4>1<5>A<6>1<7><8>NA<9>5<0>1999END" > paste(
> >> ce0, m, sep="", collapse="") [1]
> >> "1985<1>9<2>2<3>2<4>1<5>A1<7>NA<9>5<0>1999END" > ce0a [1] "1985"
> >> "9" "2" "2" "1" "A" "1" "" "NA" "5" [11] "1999" > ce0 [1] "1985"
> >> "9" "2" "2" "1" "A" "1" "" "NA" "5" [11] "1999"
> >>
> >> I suggest there can be some hidden attributes somewhere in ce0
> >> which I have not noticed (there seem not to be factors), the
> >> problem seems to arise with the non-numerical columns (ce0 is
> >> just part of one row of the big dataframe). Is it possible to
> >> figure it out, and possible change? At least attributes() do
> >> show nothing: > attributes(ce0) NULL > attributes(ce0a) NULL
>
> Peter> Hmmm. The plot would seem to thicken around the entries in
> Peter> ce0 corresponding to <6> and <8>. If these accidentally
> Peter> contain \0 characters, much would be explained. Maybe also
> Peter> other weird characters.
>
> As it happens, I think the problem is in the read.dta code. The relevant
> piece of code is in foreign/src/stataread.c (lines 317-324):
>
> default:
> charlen=INTEGER(types)[j]-STATA_STRINGOFFSET;
> PROTECT(tmp=allocString(charlen+1));
> InStringBinary(fp,charlen,CHAR(tmp));
> CHAR(tmp)[charlen]=0;
> SET_STRING_ELT(VECTOR_ELT(df,j),i,tmp);
> UNPROTECT(1);
> break;
>
> As it happens, in this case the string "A" is written in the file
> as two bytes (I do not not know why) with the second byte being '\0'.
> So the above code creates a CHARSXP of length 3 with last two bytes
> being '\0'.
>
> Peter> What happens if you do nchar(ce0) ? What if you omit the
> Peter> collapse= argument?
>
> nchar uses strlen - so it would return the length as 1.
>
> By the way, by looking at the code for mkChar and paste, it seems that
> R is _not_ storing null terminated strings - mkChar only allocates
> storage for strlen(name) and not strlen(name)+1 and paste uses LENGTH
> to get the string length. At the same time strlen is used in
> do_nchar. Could there be a potential problem here? Maybe you should
> use strnlen in do_nchar?
>
It's OK--mkChar calls allocString, which calls allocVector(CHARSXP, n)
which does size = BYTE2VEC(length + 1) -- that is where the space for
the null gets tacked on.
luke
--
Luke Tierney
University of Minnesota Phone: 612-625-7843
School of Statistics Fax: 612-624-8868
313 Ford Hall, 224 Church St. S.E. email: luke at stat.umn.edu
Minneapolis, MN 55455 USA WWW: http://www.stat.umn.edu
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
More information about the R-help
mailing list