[Rd] [R] "[.data.frame" and lapply

Sat Mar 28 19:47:20 CET 2009

Romain Francois wrote:
> Wacek Kusnierczyk wrote:
>> redirected to r-devel, because there are implementational details of
>> [.data.frame discussed here.  spoiler: at the bottom there is a fairly
>> interesting performance result.
>>
>> Romain Francois wrote:
>>  
>>> Hi,
>>>
>>> This is a bug I think. [.data.frame treats its arguments differently
>>> depending on the number of arguments.
>>>     
>>
>> you might want to hesitate a bit before you say that something in r is a
>> bug, if only because it drives certain people mad.  r is a carefully
>> tested software, and [.data.frame is such a basic function that if what
>> you talk about were a bug, it wouldn't have persisted until now.
>>   
> I did hesitate, and would be prepared to look the other way of someone
> shows me proper evidence that this makes sense.
>
> > d <- data.frame( x = 1:10, y = 1:10, z = 1:10 )
> > d[ j=1 ]
>    x  y  z
> 1   1  1  1
> 2   2  2  2
> 3   3  3  3
> 4   4  4  4
> 5   5  5  5
> 6   6  6  6
> 7   7  7  7
> 8   8  8  8
> 9   9  9  9
> 10 10 10 10
>
> "If a single index is supplied, it is interpreted as indexing the list
> of columns". Clearly this does not happen here, and this is because
> NextMethod gets confused.

obviously.  it seems that there is a bug here, and that it results from
the lack of clear design specification.

>
> I have not looked your implementation in details, but it misses array
> indexing, as in:

yes;  i didn't take it into consideration, but (still without detailed
analysis) i guess it should not be difficult to extend the code to
handle this.

>
> > d <- data.frame( x = 1:10, y = 1:10, z = 1:10 )
> > m <- cbind( 5:7, 1:3 )
> > m
>     [,1] [,2]
> [1,]    5    1
> [2,]    6    2
> [3,]    7    3
> > d[m]
> [1] 5 6 7
> > subdf( d, m )
> Error in subdf(d, m) : undefined columns selected

this should be easy to handle by checking if i is a matrix and then
indexing by its first column as i and the second as j.

>
> "Matrix indexing using '[' is not recommended, and barely
>     supported.  For extraction, 'x' is first coerced to a matrix. For
>     replacement a logical matrix (only) can be used to select the
>     elements to be replaced in the same way as for a matrix."

yes, here's how it's done (original comment):

    if(is.matrix(i))
        return(as.matrix(x)[i])  # desperate measures

and i can easily add this to my code, at virtually no additional expense.

it's probably not a good idea to convert x to a matrix, x would often be
much more data than the index matrix m, so it's presumably much more
efficient, on average, to fiddle with i instead.

there are some potentially confusing issues here:

    m = cbind(8:10, 1:3)

    d[m]
    # 3-element vector, as you could expect

    d[t(m)]
    # 6-element vector

t(m) has dimensionality inappropriate for matrix indexing (it has 3
columns), so it gets flattened into a vector;  however, it does not work
like in the case of a single vector index where columns would be selected:

    d[as.vector(t(m))]
    # error: undefined columns selected

i think it would be more appropriate to raise an error in a case like
d[t(m)].

furthermore, if a matrix is used in a two-index form, the matrix is
flattened again and is used to select rows (not elements, as in
d[t(m)]).  note also that the help page says that "for extraction, 'x'
is first coerced to a matrix".  it fails to explain that if *two*
indices are used of which at least one is a matrix, no coercion is
done.  that is, the matrix is again flattened into a vector, but here
[.data.frame forgets that it was a matrix (unlike in d[t(m)]):

    is(d[m])
    # a character vector, matrix indexing

    is(d[t(m)])
    # a character vector, vector indexing of elements, not columns

    is(d[m,])
    # a data frame, row indexing

and finally, the fact that d[m] in fact converts x (i.e., d) to a matrix
before the indexing means that the types of values in a some columns in
d may get coerced to another type:

    d[,2] = as.character(d[,2])
    is(d[,1])
    # integer vector
    is(d[,2])
    # character vector

    is(d[1:2, 1])
    # integer vector
    is(d[cbind(1:2, 1)])
    # character vector

for all it's worth, i think matrix indexing of data frames should be
dropped:

    d[m]
    # error: ...

 and if one needs it, it's as simple as

    as.matrix(d)[m]

where the conversion of d to a matrix is explicit.

on the side, [.data.frame is able to index matrices:

    '[.data.frame'(as.matrix(d), m)
    # same as as.matrix(d)[m]

which is, so to speak, nonsense, since '[.data.frame' is designed
specifically to handle data frames;  i'd expect an error to be raised
here (or a warning, at the very least).

to summarize, the fact that subdf does not handle matrix indices is not
an issue.  anyway, thanks for the comment!

best,
vQ