[Rd] as.data.frame peculiarities
Wacek Kusnierczyk
Waclaw.Marcin.Kusnierczyk at idi.ntnu.no
Tue Mar 31 21:24:12 CEST 2009
Stavros Macrakis wrote:
> The documentation of as.data.frame is not explicit about how it generates
> column names for the simple vector case, but it seems to use the character
> form of the quoted argument, e.g.
>
> names(as.data.frame(1:3))
> [1] "1:3"
>
> But there is a strange case:
>
> names(as.data.frame(c("a")))
> [1] "if (stringsAsFactors) factor(x) else x"
>
>
gosh! you don't even need the c():
names(as.data.frame(''))
# same as above
i thought you don''t even need the '', but then you're served with the
following highly informative message:
names(as.data.frame())
# Error in as.data.frame() :
# element 1 is empty;
# the part of the args list of 'is.null' being evaluated was:
# (x)
which actually comes from as.data.frame().
> I feel fairly comfortable calling this a bug, though there is no explicit
> specification.
>
maybe there is none so that it can always be claimed that you deal with
an intentional, but not (yet) documented feature, rather than a bug.
let's investigate this feature. in
names(as.data.frame('a'))
as.data.frame is generic, 'a' is character, thus
as.data.frame.character(x, ...) is called with x = 'a'. here's the
code for as.data.frame.character:
function (x, ..., stringsAsFactors = default.stringsAsFactors())
as.data.frame.vector(if (stringsAsFactors) factor(x) else x, ...)
and the as.data.frame.vector it calls:
function (x, row.names = NULL, optional = FALSE, ...)
{
nrows <- length(x)
nm <- paste(deparse(substitute(x), width.cutoff = 500L),
collapse = " ")
if (is.null(row.names)) {
if (nrows == 0L)
row.names <- character(0L)
else if (length(row.names <- names(x)) == nrows &&
!any(duplicated(row.names))) {
}
else row.names <- .set_row_names(nrows)
}
names(x) <- NULL
value <- list(x)
if (!optional)
names(value) <- nm
attr(value, "row.names") <- row.names
class(value) <- "data.frame"
value
}
watch carefully: nm = paste(deparse(substitute(x)), width.cutoff=500L),
that is:
nm = paste("if (stringsAsFactors) factor(x) else x", width.cutoff=500L)
x = factor('a'), row.names==NULL, names(x)==NULL, and nrows = 1, and
thus row.names = .set_row_names(1) = c(NA, -1) (interesting; see
.set_row_names).
and then we have:
x = factor('a') # the input
names(x) = NULL
value = list(x) # value == list(factor('a'))
names(value) = "if (stringsAsFactors) factor(x) else x" # the value
of nm
attr(value, 'row.names') = c(NA, -1) # the value of row.names
class(value) = 'data.frame'
value
here you go: as some say, the answer is always in the code. that's how
ugly hacks with deparse/substitute lead r core developers to produce
ugly bugs. very useful, indeed.
> There is another strange case which I don't understand.
>
> The specification of 'optional' is:
>
> optional: logical. If 'TRUE', setting row names and converting column
> names (to syntactic names: see 'make.names') is optional.
>
> I am not sure what this means and why it is useful. In practice, it seems
> to produce a structure of class data.frame which exhibits some very odd
> behavior:
>
>
>> d <- as.data.frame(c("a"),optional=TRUE)
>> class(d)
>>
> [1] "data.frame"
>
>> d
>>
> structure("a", class = "AsIs") <<< where does this
> column name come from?
> 1 a'
>
gosh... rtfc, again; code as above, but this time optional=TRUE so
names(value) = nm does not apply:
x = factor('a') # the input
names(x) = NULL
value = list(x) # value == list(factor('a'))
attr(value, 'row.names') = c(NA, -1) # the value of row.names
class(value) = 'data.frame'
value
here you go.
>> names(d)
>>
> NULL <<< not from names()
>
yes, because it was explicitly set to NULL, second line above.
>> dput(d)
>>
> structure(list(structure(1L, .Label = "a", class = "factor")), row.names =
> c(NA,
> -1L), class = "data.frame") <<< and it doesn't show up in dput
>
yes, because there are no names there! it's format.data.frame, called
from print.data.frame, called from print(value), that makes up this
column name; rtfc.
seems like there's a need for post-implementation design.
for the desserts, here's another curious, somewhat related example:
data = data.frame(1)
row.names(data) = TRUE
data
# X1
# TRUE 1
as.data.frame(1, row.names=TRUE)
# Error in attr(value, "row.names") <- row.names :
# row names must be 'character' or 'integer', not 'logical'
probably not a bug, because ?as.data.frame says:
"
row.names: 'NULL' or a character vector giving the row names for the
data frame. Missing values are not allowed.
"
so it's rather a design flaw. much harder to fix in r.
best,
vQ
More information about the R-devel
mailing list