[Rd] [R] "[.data.frame" and lapply

Sat Mar 28 11:09:52 CET 2009

Wacek Kusnierczyk wrote:
> redirected to r-devel, because there are implementational details of
> [.data.frame discussed here.  spoiler: at the bottom there is a fairly
> interesting performance result.
>
> Romain Francois wrote:
>   
>> Hi,
>>
>> This is a bug I think. [.data.frame treats its arguments differently
>> depending on the number of arguments.
>>     
>
> you might want to hesitate a bit before you say that something in r is a
> bug, if only because it drives certain people mad.  r is a carefully
> tested software, and [.data.frame is such a basic function that if what
> you talk about were a bug, it wouldn't have persisted until now.
>   
I did hesitate, and would be prepared to look the other way of someone 
shows me proper evidence that this makes sense.

 > d <- data.frame( x = 1:10, y = 1:10, z = 1:10 )
 > d[ j=1 ]
    x  y  z
1   1  1  1
2   2  2  2
3   3  3  3
4   4  4  4
5   5  5  5
6   6  6  6
7   7  7  7
8   8  8  8
9   9  9  9
10 10 10 10

"If a single index is supplied, it is interpreted as indexing the list 
of columns". Clearly this does not happen here, and this is because 
NextMethod gets confused.

I have not looked your implementation in details, but it misses array 
indexing, as in:

 > d <- data.frame( x = 1:10, y = 1:10, z = 1:10 )
 > m <- cbind( 5:7, 1:3 )
 > m
     [,1] [,2]
[1,]    5    1
[2,]    6    2
[3,]    7    3
 > d[m]
[1] 5 6 7
 > subdf( d, m )
Error in subdf(d, m) : undefined columns selected

"Matrix indexing using '[' is not recommended, and barely
     supported.  For extraction, 'x' is first coerced to a matrix. For
     replacement a logical matrix (only) can be used to select the
     elements to be replaced in the same way as for a matrix."

You might also want to look at `[<-.data.frame`.

 > d[j=2] <- 1:10
Error in `[<-.data.frame`(`*tmp*`, j = 2, value = 1:10) :
  element 1 is empty;
   the part of the args list of 'is.logical' being evaluated was:
   (i)
 > d[2] <- 10:1
 > d
    x  y  z
1   1 10  1
2   2  9  2
3   3  8  3
4   4  7  4
5   5  6  5
6   6  5  6
7   7  4  7
8   8  3  8
9   9  2  9
10 10  1 10

This is probably less of an issue, because there is very little chance 
for people to use this construct, but for the first one, if not used 
directly, it still has good chances to be used within some fooapply 
call, as in the original post. Although it might have been preferable to 
use subset as the applied function.

Romain
> treating the arguments differently depending on their number is actually
> (if clearly...) documented:  if there is one index (the 'i'), it selects
> columns.  if there are two, 'i' selects rows.
>
> however, not all seems fine, there might be a design flaw:
>
>     # dummy data frame
>     d = structure(names=paste('col', 1:3, sep='.'),
>         data.frame(row.names=paste('row', 1:3, sep='.'),
>            matrix(1:9, 3, 3)))
>
>     d[1:2]
>     # correctly selects two first columns
>     # 1:2 passed to [.data.frame as i, no j given
>
>     d[,1:2]
>     # correctly selects two first columns
>     # 1:2 passed to [.data.frame as j, i given the missing argument
> value (note the comma)
>
>     d[,i=1:2]
>     # correctly selects two first rows
>     # 1:2 passed to [.data.frame as i, j given the missing argument
> value (note the comma)
>
>     d[j=1:2,]
>     # correctly selects two first columns
>     # 1:2 passed to [.data.frame as j, i given the missing argument
> value (note the comma)
>
>     d[i=1:2]
>     # correctly (arguably) selects the first two columns
>     # 1:2 passed to [.data.frame as i, no j given
>   
>     d[j=1:2]
>     # wrong: returns the whole data frame
>     # does not recognize the index as i because it is explicitly named 'j'
>     # does not recognize the index as j because there is only one index
>
> i say this *might* be a design flaw because it's hard to judge what the
> design really is.  the r language definition (!) [1, sec. 3.4.3 p. 18] says:
>
> "   The most important example of a class method for [ is that used for
> data frames. It is not
> be described in detail here (see the help page for [.data.frame, but in
> broad terms, if two
> indices are supplied (even if one is empty) it creates matrix-like
> indexing for a structure that is
> basically a list of vectors of the same length. If a single index is
> supplied, it is interpreted as
> indexing the list of columns—in that case the drop argument is ignored,
> with a warning."
>
> it does not say what happens when only one *named* index argument is
> given.  from the above, it would indeed seem that there is a *bug*
> here:  in the last example above only one index is given, and yet
> columns are not selected, even though the *language definition* says
> they should.  (so it's not a documented feature, it's a
> contra-definitional misfeature -- a bug?)
>
> somewhat on the side, the 'matrix-like indexing' above is fairly
> misleading;  just try the same patterns of indexing -- one index, two
> indices, named indices -- on a data frame and a matrix of the same shape:
>
>     m = matrix(1:9, 3, 3)
>     md = data.frame(m)
>
>     md[1]
>     # the first column
>     m[1]
>     # the first element (i.e., m[1,1])
>
>     md[,i=3]
>     # third row
>     m[,i=3]
>     # third column
>
>
> the quote above refers to the ?'[.data.frame' for details. 
> unfortunately, it the help page a lump of explanations for various
> '['-like operators, and it is *not* a definition of any sort.  it does
> not provide much more detail on '[.data.frame' -- it is hardly as a
> design specification.  in particular, it does not explain the issue of
> named arguments to '[.data.frame' at all.
>
>
> `[.data.frame` only is called with two arguments in the second case,  
>   
>> so
>> the following condition is true:
>>
>> if(Narg < 3L) {  # list-like indexing or matrix indexing
>>
>> And then, the function assumes the argument it has been passed is i,  
>> and
>> eventually calls NextMethod("[") which I think calls
>> `[.listof`(x,i,...), since i is missing in `[.data.frame` it is not
>> passed to `[.listof`, so you have something equivalent to as.list(d) 
>> [].
>>
>> I think we can replace the condition with this one:
>>
>> if(Narg < 3L && !has.j) {  # list-like indexing or matrix indexing
>>
>> or this:
>>
>> if(Narg < 3L) {  # list-like indexing or matrix indexing
>>        if(has.j) i <- j
>>
>>     
>
>
> indeed, for a moment i thought a trivial fix somewhere there would
> suffice.  unfortunately, the code for [.data.frame [2, lines 500-641] is
> so clean and readable that i had to give up reading it, forget fixing. 
> instead, i wrote an new version of '[.data.frame' from scratch.  it
> fixes (or at least seems to fix, as far as my quick assessment goes) the
> problem.  the function subdf (see the attached dataframe.r) is the new
> version of '[.data.frame':
>
>     # dummy data frame
>     d = structure(names=paste('col', 1:3, sep='.'),
>         data.frame(row.names=paste('row', 1:3, sep='.'),
>            matrix(1:9, 3, 3)))
>
>     d[j=1:2]
>     # incorrect: the whole data frame
>
>     subdf(d, j=1:2)
>     # correct, only the first two columns
>
> otherwise, subdf returns results equivalent (sensu all.equal;  see
> below) to those returned by [.data.frame on the same input, modulo some
> more or less minor details.  for example, i think the dropped-drop
> warnings go wrong in the original:
>
>     d[1, drop=FALSE]
>     # warning: drop argument will be ignored
>
> which suggests that dimensions will be dropped, while the intention is
> that the actual argument will be ignored and the value will be FALSE
> instead (while the default is TRUE, since i is specified).  well, it's
> just one more confusing bit in r.  the rewritten version warns about
> dropped drop only if it is explicitly TRUE:
>
>     subdf(d, 1, drop=FALSE)
>     # no warning
>     subdf(d, 1, drop=TRUE)
>     # warning
>
> another issue the differs in my version is that i don't see much sense
> in being able to select rows by indexing with NA:
>
>     d[NA,1]
>     # one row filled with NAs
>
>     d[NA,]
>     # data frame of the shape of d, filled with NAs
>
> which is incoherent with how NA are treated in columns indices (i.e.,
> raise an error).  the rewritten version raises an error if any element
> of any index is an NA.
>
> these minor differences are easily modifiable should compliance with the
> original 'design' be desirable.
>
> interestingly, there is a reduction in code by some 40 lines (~30%) wrt.
> the original, even though the new code is quite redundant (but thus were
> the original, too).  with a little effort, it can be compressed further,
> but i felt it would become more convoluted and less readable, and also
> less efficient.  procedural abstraction could help, but would also
> negatively impact performance.  (presumably, an implementation in c
> would run faster.)
>
> incidentally (here's the best part!), my version seems to perform much
> better than the original, at least in a limited set of naive
> benchmarks.  here are some results, which you can (hopefully) reproduce
> using the code in the attached test.r.  the data is a dummy df with 1k
> rows and 1k columns, filled with rnorm;  each indexing was repeated 1000
> times for both the original and the modified version:
>
>    original patched ratio   test                                    
> 1  0.002    0.001      2.00 d[]                                     
> 2  0.027    0.001     27.00 d[drop = FALSE]                         
> 3  0.025    0.002     12.50 d[drop = TRUE]                          
> 4  0.026    0.002     13.00 d[, drop = FALSE]                       
> 5  0.026    0.003      8.67 d[, drop = TRUE]                        
> 6  1.274    0.002    637.00 d[, ]                                   
> 7  1.255    0.001   1255.00 d[, , ]                                 
> 8  1.183    0.001   1183.00 d[, , drop = FALSE]                     
> 9  1.183    0.003    394.33 d[, , drop = TRUE]                      
> 10 0.013    0.011      1.18 d[r]                                    
> 11 0.040    0.034      1.18 d[r, drop = TRUE]                       
> 12 0.037    0.010      3.70 d[r, drop = FALSE]                      
> 13 0.012    0.011      1.09 d[i = r]                                
> 14 0.036    0.034      1.06 d[i = r, drop = TRUE]                   
> 15 0.037    0.011      3.36 d[i = r, drop = FALSE]                  
> 16 0.222    0.163      1.36 d[rr]                                   
> 17 0.247    0.112      2.21 d[rr, drop = FALSE]                     
> 18 0.204    0.144      1.42 d[rr, drop = TRUE]                      
> 19 0.174    0.120      1.45 d[i = rr]                               
> 20 0.201    0.125      1.61 d[i = rr, drop = FALSE]                 
> 21 0.215    0.147      1.46 d[i = rr, drop = TRUE]                  
> 22 2.266    1.159      1.96 d[rr, ]                                 
> 23 2.236    1.164      1.92 d[rr, , drop = FALSE]                   
> 24 2.275    1.171      1.94 d[rr, , drop = TRUE]                    
> 25 2.269    1.165      1.95 d[i = rr, ]                             
> 26 2.264    1.155      1.96 d[i = rr, , drop = FALSE]               
> 27 2.290    1.189      1.93 d[i = rr, , drop = TRUE]                
> 28 2.301    1.198      1.92 d[, i = rr]                             
> 29 2.239    1.158      1.93 d[, i = rr, drop = FALSE]               
> 30 2.310    1.161      1.99 d[, i = rr, drop = TRUE]                
> 31 0.002    0.003      0.67 d[j = c]                                
> 32 0.026    0.011      2.36 d[j = c, drop = FALSE]                  
> 33 0.026    0.003      8.67 d[j = c, drop = TRUE]                   
> 34 0.001    0.111      0.01 d[j = cc]                               
> 35 0.025    0.110      0.23 d[j = cc, drop = FALSE]                 
> 36 0.025    0.111      0.23 d[j = cc, drop = TRUE]                  
> 37 0.243    0.051      4.76 d[rr, cc]                               
> 38 0.243    0.051      4.76 d[rr, cc, drop = FALSE]                 
> 39 0.244    0.050      4.88 d[rr, cc, drop = TRUE]                  
> 40 0.244    0.051      4.78 d[i = rr, cc]                           
> 41 0.243    0.050      4.86 d[i = rr, cc, drop = FALSE]             
> 42 0.244    0.051      4.78 d[i = rr, cc, drop = TRUE]              
> 43 0.243    0.052      4.67 d[cc, i = rr]                           
> 44 0.244    0.050      4.88 d[cc, i = rr, drop = FALSE]             
> 45 0.247    0.052      4.75 d[cc, i = rr, drop = TRUE]              
> 46 0.244    0.050      4.88 d[i = rr, j = cc]                       
> 47 0.244    0.051      4.78 d[i = rr, j = cc, drop = FALSE]         
> 48 0.244    0.051      4.78 d[i = rr, j = cc, drop = TRUE]          
> 49 0.244    0.051      4.78 d[j = cc, i = rr]                       
> 50 0.243    0.051      4.76 d[j = cc, i = rr, drop = FALSE]         
> 51 0.245    0.051      4.80 d[j = cc, i = rr, drop = TRUE]          
> 52 0.002    0.155      0.01 d[j = cn]                               
> 53 0.429    0.139      3.09 d[i = rn, j = cn]                       
> 54 1.791    0.690      2.60 d[i = c(TRUE, FALSE), j = c(FALSE, TRUE)]
>
> (note:  the benchmark relies on a feature of rbenchmark that i have just
> added, so you may need to download/update the package before trying.)
>
> in some tests, the difference is two orders of magnitude; in some it's a
> factor of 2-5;  in some there's no significant difference.  in only a
> few cases, the original is way faster (e.g., tests 34 and 52), but this
> is because the original is wrong there (it simply ignores the index, so
> no wonder).
>
> all the expressions above used in benchmarking were also used to test
> the equivalence of output from the original and the new version (see
> test.r again), and all of them were negative (no difference) -- except
> for the cases where the original was wrong.
>
>
> i'd consider making a patch for src/library/base/R/dataframe.R, but
> there's a hack here:  it seems that some code relies on some part of the
> 'design' that differs between the rewrite and the original, and the new
> code does not make (dataframe.R does, but then other sources fail). 
> anyway, sourcing the attached dataframe.R suffices for testing. 
>
> i will be happy to learn where my implementation, benchmarking, and/or
> result checking are naive or wrong in any way, as they surely are.
>
>
> vQ
>
>
>
> [1] http://cran.r-project.org/doc/manuals/R-lang.pdf
> [2] http://svn.r-project.org/R/trunk/src/library/base/R/dataframe.R
>   

-- 
Romain Francois
Independent R Consultant
+33(0) 6 28 91 30 30
http://romainfrancois.blog.free.fr