[Rd] special handling of row.names
Romain François
romain at r-enthusiasts.com
Wed Apr 2 08:52:22 CEST 2014
Hello,
I think there is an inconsistency in the handling of the compact form of the row.names attributes.
When n is the number of rows of a data.frame, the compact form is c(NA_integer_,-n), as in:
> d <- data.frame(x=1:10)
> .Internal(inspect(d))
@104f174a8 19 VECSXP g0c1 [OBJ,NAM(2),ATT] (len=1, tl=0)
@103a7dc60 13 INTSXP g0c4 [] (len=10, tl=0) 1,2,3,4,5,...
ATTRIB:
@104959380 02 LISTSXP g0c0 []
TAG: @100823078 01 SYMSXP g1c0 [MARK,LCK,gp=0x4000] "names" (has value)
@104f17748 16 STRSXP g0c1 [NAM(2)] (len=1, tl=0)
@10085c678 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached] "x"
TAG: @10082d060 01 SYMSXP g1c0 [MARK,LCK,gp=0x4000] "row.names" (has value)
@104f0e898 13 INTSXP g0c1 [] (len=2, tl=0) -2147483648,-10
TAG: @100823548 01 SYMSXP g1c0 [MARK,LCK,gp=0x4000] "class" (has value)
@104f0e8c8 16 STRSXP g0c1 [NAM(1)] (len=1, tl=0)
@1008a7e60 09 CHARSXP g1c2 [MARK,gp=0x61,ATT] [ASCII] [cached] "data.frame"
But then, -10 becomes 10:
> d2 <- structure( d, class = "data.frame" )
> .Internal(inspect(d2))
@104f08898 19 VECSXP g0c1 [OBJ,NAM(2),ATT] (len=1, tl=0)
@103a7dc60 13 INTSXP g0c4 [NAM(2)] (len=10, tl=0) 1,2,3,4,5,...
ATTRIB:
@104956150 02 LISTSXP g0c0 []
TAG: @100823078 01 SYMSXP g1c0 [MARK,LCK,gp=0x4000] "names" (has value)
@104f087a8 16 STRSXP g0c1 [] (len=1, tl=0)
@10085c678 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached] "x"
TAG: @10082d060 01 SYMSXP g1c0 [MARK,LCK,gp=0x4000] "row.names" (has value)
@104f088c8 13 INTSXP g0c1 [] (len=2, tl=0) -2147483648,10
TAG: @100823548 01 SYMSXP g1c0 [MARK,LCK,gp=0x4000] "class" (has value)
@104f08838 16 STRSXP g0c1 [] (len=1, tl=0)
@1008a7e60 09 CHARSXP g1c2 [MARK,gp=0x61,ATT] [ASCII] [cached] "data.frame"
This happens in row_names_gets (attrib.c), here:
if(OK_compact) {
/* we hide the length in an impossible integer vector */
PROTECT(val = allocVector(INTSXP, 2));
INTEGER(val)[0] = NA_INTEGER;
INTEGER(val)[1] = n;
ans = installAttrib(vec, R_RowNamesSymbol, val);
UNPROTECT(1);
return ans;
}
I believe it should be INTEGER(val)[1] = -n; for consistency.
BTW, perhaps structure should be internalized to prevent special handling of row.names when it does not make sense. Here is structure:
structure
function (.Data, ...)
{
attrib <- list(...)
if (length(attrib)) {
specials <- c(".Dim", ".Dimnames", ".Names", ".Tsp",
".Label")
replace <- c("dim", "dimnames", "names", "tsp", "levels")
m <- match(names(attrib), specials)
ok <- (!is.na(m) & m)
names(attrib)[ok] <- replace[m[ok]]
if ("factor" %in% attrib[["class", exact = TRUE]] &&
typeof(.Data) == "double")
storage.mode(.Data) <- "integer"
attributes(.Data) <- c(attributes(.Data), attrib)
}
return(.Data)
}
When I do structure( d, class = "data.frame" ), eventually this line is executed:
attributes(.Data) <- c(attributes(.Data), attrib)
So, first attributes are retrieved, and because of R special handling of row.names, it gets promoted to 1:n in getAttrib.
Then, we want to set the row.names attribute, special handling again in setAttrib, leading to row_names_gets, R actually loops over the attribute to check if it is of the form 1:n, and if it is it brings back the compact form (making a mistake along the way).
This looks like a waste of resources.
Romain
More information about the R-devel
mailing list