[Rd] as.data.frame peculiarities

Tue Mar 31 21:24:12 CEST 2009

Stavros Macrakis wrote:
> The documentation of as.data.frame is not explicit about how it generates
> column names for the simple vector case, but it seems to use the character
> form of the quoted argument, e.g.
>
> names(as.data.frame(1:3))
> [1] "1:3"
>
> But there is a strange case:
>
> names(as.data.frame(c("a")))
> [1] "if (stringsAsFactors) factor(x) else x"
>
>   

gosh!  you don't even need the c():

    names(as.data.frame(''))
    # same as above

i thought you don''t even need the '', but then you're served with the
following highly informative message:

    names(as.data.frame())
    # Error in as.data.frame() :
    #   element 1 is empty;
    #    the part of the args list of 'is.null' being evaluated was:
    #    (x)

which actually comes from as.data.frame().

> I feel fairly comfortable calling this a bug, though there is no explicit
> specification.
>   

maybe there is none so that it can always be claimed that you deal with
an intentional, but not (yet) documented feature, rather than a bug.

let's investigate this feature.  in

    names(as.data.frame('a'))

as.data.frame is generic, 'a' is character, thus
as.data.frame.character(x, ...)  is called with x = 'a'.  here's  the
code for as.data.frame.character:

    function (x, ..., stringsAsFactors = default.stringsAsFactors())
        as.data.frame.vector(if (stringsAsFactors) factor(x) else x, ...)

and the as.data.frame.vector it calls:

    function (x, row.names = NULL, optional = FALSE, ...)
    {
        nrows <- length(x)
        nm <- paste(deparse(substitute(x), width.cutoff = 500L),
            collapse = " ")
        if (is.null(row.names)) {
            if (nrows == 0L)
                row.names <- character(0L)
            else if (length(row.names <- names(x)) == nrows &&
!any(duplicated(row.names))) {
            }
            else row.names <- .set_row_names(nrows)
        }
        names(x) <- NULL
        value <- list(x)
        if (!optional)
            names(value) <- nm
        attr(value, "row.names") <- row.names
        class(value) <- "data.frame"
        value
    }

watch carefully:  nm = paste(deparse(substitute(x)), width.cutoff=500L),
that is:

    nm = paste("if (stringsAsFactors) factor(x) else x", width.cutoff=500L)

x = factor('a'), row.names==NULL, names(x)==NULL, and nrows = 1, and
thus row.names = .set_row_names(1) = c(NA, -1)  (interesting; see
.set_row_names).

and then we have:

    x = factor('a') # the input
    names(x) = NULL
    value = list(x) # value == list(factor('a'))
    names(value) = "if (stringsAsFactors) factor(x) else x" # the value
of nm
    attr(value, 'row.names') = c(NA, -1) # the value of row.names
    class(value) = 'data.frame'
    value

here you go:  as some say, the answer is always in the code.  that's how
ugly hacks with deparse/substitute lead r core developers to produce
ugly bugs.  very useful, indeed.

> There is another strange case which I don't understand.
>
> The specification of 'optional' is:
>
>    optional: logical. If 'TRUE', setting row names and converting column
>           names (to syntactic names: see 'make.names') is optional.
>
> I am not sure what this means and why it is useful.  In practice, it seems
> to produce a structure of class data.frame which exhibits some very odd
> behavior:
>
>   
>> d <- as.data.frame(c("a"),optional=TRUE)
>> class(d)
>>     
> [1] "data.frame"
>   
>> d
>>     
>   structure("a", class = "AsIs")                   <<< where does this
> column name come from?
> 1                              a'
>   

gosh...  rtfc, again; code as above, but this time optional=TRUE so
names(value) = nm does not apply:

    x = factor('a') # the input
    names(x) = NULL
    value = list(x) # value == list(factor('a'))
    attr(value, 'row.names') = c(NA, -1) # the value of row.names
    class(value) = 'data.frame'
    value

here you go.

>> names(d)
>>     
> NULL                                            <<< not from names()
>   

yes, because it was explicitly set to NULL, second line above.

>> dput(d)
>>     
> structure(list(structure(1L, .Label = "a", class = "factor")), row.names =
> c(NA,
> -1L), class = "data.frame")            <<< and it doesn't show up in dput
>   

yes, because there are no names there!  it's format.data.frame, called
from print.data.frame, called from print(value), that makes up this
column name;  rtfc.

seems like there's a need for post-implementation design.

for the desserts, here's another curious, somewhat related example:

    data = data.frame(1)
    row.names(data) = TRUE
    data
    #      X1
    # TRUE  1

    as.data.frame(1, row.names=TRUE)
    # Error in attr(value, "row.names") <- row.names :
    #   row names must be 'character' or 'integer', not 'logical'

probably not a bug, because ?as.data.frame says:

"
row.names: 'NULL' or a character vector giving the row names for the
          data frame.  Missing values are not allowed.
"

so it's rather a design flaw.  much harder to fix in r.

best,
vQ