[Rd] [R] "[.data.frame" and lapply
Wacek Kusnierczyk
Waclaw.Marcin.Kusnierczyk at idi.ntnu.no
Sat Mar 28 19:47:20 CET 2009
Romain Francois wrote:
> Wacek Kusnierczyk wrote:
>> redirected to r-devel, because there are implementational details of
>> [.data.frame discussed here. spoiler: at the bottom there is a fairly
>> interesting performance result.
>>
>> Romain Francois wrote:
>>
>>> Hi,
>>>
>>> This is a bug I think. [.data.frame treats its arguments differently
>>> depending on the number of arguments.
>>>
>>
>> you might want to hesitate a bit before you say that something in r is a
>> bug, if only because it drives certain people mad. r is a carefully
>> tested software, and [.data.frame is such a basic function that if what
>> you talk about were a bug, it wouldn't have persisted until now.
>>
> I did hesitate, and would be prepared to look the other way of someone
> shows me proper evidence that this makes sense.
>
> > d <- data.frame( x = 1:10, y = 1:10, z = 1:10 )
> > d[ j=1 ]
> x y z
> 1 1 1 1
> 2 2 2 2
> 3 3 3 3
> 4 4 4 4
> 5 5 5 5
> 6 6 6 6
> 7 7 7 7
> 8 8 8 8
> 9 9 9 9
> 10 10 10 10
>
> "If a single index is supplied, it is interpreted as indexing the list
> of columns". Clearly this does not happen here, and this is because
> NextMethod gets confused.
obviously. it seems that there is a bug here, and that it results from
the lack of clear design specification.
>
> I have not looked your implementation in details, but it misses array
> indexing, as in:
yes; i didn't take it into consideration, but (still without detailed
analysis) i guess it should not be difficult to extend the code to
handle this.
>
> > d <- data.frame( x = 1:10, y = 1:10, z = 1:10 )
> > m <- cbind( 5:7, 1:3 )
> > m
> [,1] [,2]
> [1,] 5 1
> [2,] 6 2
> [3,] 7 3
> > d[m]
> [1] 5 6 7
> > subdf( d, m )
> Error in subdf(d, m) : undefined columns selected
this should be easy to handle by checking if i is a matrix and then
indexing by its first column as i and the second as j.
>
> "Matrix indexing using '[' is not recommended, and barely
> supported. For extraction, 'x' is first coerced to a matrix. For
> replacement a logical matrix (only) can be used to select the
> elements to be replaced in the same way as for a matrix."
yes, here's how it's done (original comment):
if(is.matrix(i))
return(as.matrix(x)[i]) # desperate measures
and i can easily add this to my code, at virtually no additional expense.
it's probably not a good idea to convert x to a matrix, x would often be
much more data than the index matrix m, so it's presumably much more
efficient, on average, to fiddle with i instead.
there are some potentially confusing issues here:
m = cbind(8:10, 1:3)
d[m]
# 3-element vector, as you could expect
d[t(m)]
# 6-element vector
t(m) has dimensionality inappropriate for matrix indexing (it has 3
columns), so it gets flattened into a vector; however, it does not work
like in the case of a single vector index where columns would be selected:
d[as.vector(t(m))]
# error: undefined columns selected
i think it would be more appropriate to raise an error in a case like
d[t(m)].
furthermore, if a matrix is used in a two-index form, the matrix is
flattened again and is used to select rows (not elements, as in
d[t(m)]). note also that the help page says that "for extraction, 'x'
is first coerced to a matrix". it fails to explain that if *two*
indices are used of which at least one is a matrix, no coercion is
done. that is, the matrix is again flattened into a vector, but here
[.data.frame forgets that it was a matrix (unlike in d[t(m)]):
is(d[m])
# a character vector, matrix indexing
is(d[t(m)])
# a character vector, vector indexing of elements, not columns
is(d[m,])
# a data frame, row indexing
and finally, the fact that d[m] in fact converts x (i.e., d) to a matrix
before the indexing means that the types of values in a some columns in
d may get coerced to another type:
d[,2] = as.character(d[,2])
is(d[,1])
# integer vector
is(d[,2])
# character vector
is(d[1:2, 1])
# integer vector
is(d[cbind(1:2, 1)])
# character vector
for all it's worth, i think matrix indexing of data frames should be
dropped:
d[m]
# error: ...
and if one needs it, it's as simple as
as.matrix(d)[m]
where the conversion of d to a matrix is explicit.
on the side, [.data.frame is able to index matrices:
'[.data.frame'(as.matrix(d), m)
# same as as.matrix(d)[m]
which is, so to speak, nonsense, since '[.data.frame' is designed
specifically to handle data frames; i'd expect an error to be raised
here (or a warning, at the very least).
to summarize, the fact that subdf does not handle matrix indices is not
an issue. anyway, thanks for the comment!
best,
vQ
More information about the R-devel
mailing list