[Rd] [R] "[.data.frame" and lapply

Fri Mar 27 22:27:41 CET 2009

redirected to r-devel, because there are implementational details of
[.data.frame discussed here.  spoiler: at the bottom there is a fairly
interesting performance result.

Romain Francois wrote:
>
> Hi,
>
> This is a bug I think. [.data.frame treats its arguments differently
> depending on the number of arguments.

you might want to hesitate a bit before you say that something in r is a
bug, if only because it drives certain people mad.  r is a carefully
tested software, and [.data.frame is such a basic function that if what
you talk about were a bug, it wouldn't have persisted until now.

treating the arguments differently depending on their number is actually
(if clearly...) documented:  if there is one index (the 'i'), it selects
columns.  if there are two, 'i' selects rows.

however, not all seems fine, there might be a design flaw:

    # dummy data frame
    d = structure(names=paste('col', 1:3, sep='.'),
        data.frame(row.names=paste('row', 1:3, sep='.'),
           matrix(1:9, 3, 3)))

    d[1:2]
    # correctly selects two first columns
    # 1:2 passed to [.data.frame as i, no j given

    d[,1:2]
    # correctly selects two first columns
    # 1:2 passed to [.data.frame as j, i given the missing argument
value (note the comma)

    d[,i=1:2]
    # correctly selects two first rows
    # 1:2 passed to [.data.frame as i, j given the missing argument
value (note the comma)

    d[j=1:2,]
    # correctly selects two first columns
    # 1:2 passed to [.data.frame as j, i given the missing argument
value (note the comma)

    d[i=1:2]
    # correctly (arguably) selects the first two columns
    # 1:2 passed to [.data.frame as i, no j given

    d[j=1:2]
    # wrong: returns the whole data frame
    # does not recognize the index as i because it is explicitly named 'j'
    # does not recognize the index as j because there is only one index

i say this *might* be a design flaw because it's hard to judge what the
design really is.  the r language definition (!) [1, sec. 3.4.3 p. 18] says:

"   The most important example of a class method for [ is that used for
data frames. It is not
be described in detail here (see the help page for [.data.frame, but in
broad terms, if two
indices are supplied (even if one is empty) it creates matrix-like
indexing for a structure that is
basically a list of vectors of the same length. If a single index is
supplied, it is interpreted as
indexing the list of columns—in that case the drop argument is ignored,
with a warning."

it does not say what happens when only one *named* index argument is
given.  from the above, it would indeed seem that there is a *bug*
here:  in the last example above only one index is given, and yet
columns are not selected, even though the *language definition* says
they should.  (so it's not a documented feature, it's a
contra-definitional misfeature -- a bug?)

somewhat on the side, the 'matrix-like indexing' above is fairly
misleading;  just try the same patterns of indexing -- one index, two
indices, named indices -- on a data frame and a matrix of the same shape:

    m = matrix(1:9, 3, 3)
    md = data.frame(m)

    md[1]
    # the first column
    m[1]
    # the first element (i.e., m[1,1])

    md[,i=3]
    # third row
    m[,i=3]
    # third column

the quote above refers to the ?'[.data.frame' for details. 
unfortunately, it the help page a lump of explanations for various
'['-like operators, and it is *not* a definition of any sort.  it does
not provide much more detail on '[.data.frame' -- it is hardly as a
design specification.  in particular, it does not explain the issue of
named arguments to '[.data.frame' at all.

`[.data.frame` only is called with two arguments in the second case,  
> so
> the following condition is true:
>
> if(Narg < 3L) {  # list-like indexing or matrix indexing
>
> And then, the function assumes the argument it has been passed is i,  
> and
> eventually calls NextMethod("[") which I think calls
> `[.listof`(x,i,...), since i is missing in `[.data.frame` it is not
> passed to `[.listof`, so you have something equivalent to as.list(d) 
> [].
>
> I think we can replace the condition with this one:
>
> if(Narg < 3L && !has.j) {  # list-like indexing or matrix indexing
>
> or this:
>
> if(Narg < 3L) {  # list-like indexing or matrix indexing
>        if(has.j) i <- j
>

indeed, for a moment i thought a trivial fix somewhere there would
suffice.  unfortunately, the code for [.data.frame [2, lines 500-641] is
so clean and readable that i had to give up reading it, forget fixing. 
instead, i wrote an new version of '[.data.frame' from scratch.  it
fixes (or at least seems to fix, as far as my quick assessment goes) the
problem.  the function subdf (see the attached dataframe.r) is the new
version of '[.data.frame':

    # dummy data frame
    d = structure(names=paste('col', 1:3, sep='.'),
        data.frame(row.names=paste('row', 1:3, sep='.'),
           matrix(1:9, 3, 3)))

    d[j=1:2]
    # incorrect: the whole data frame

    subdf(d, j=1:2)
    # correct, only the first two columns

otherwise, subdf returns results equivalent (sensu all.equal;  see
below) to those returned by [.data.frame on the same input, modulo some
more or less minor details.  for example, i think the dropped-drop
warnings go wrong in the original:

    d[1, drop=FALSE]
    # warning: drop argument will be ignored

which suggests that dimensions will be dropped, while the intention is
that the actual argument will be ignored and the value will be FALSE
instead (while the default is TRUE, since i is specified).  well, it's
just one more confusing bit in r.  the rewritten version warns about
dropped drop only if it is explicitly TRUE:

    subdf(d, 1, drop=FALSE)
    # no warning
    subdf(d, 1, drop=TRUE)
    # warning

another issue the differs in my version is that i don't see much sense
in being able to select rows by indexing with NA:

    d[NA,1]
    # one row filled with NAs

    d[NA,]
    # data frame of the shape of d, filled with NAs

which is incoherent with how NA are treated in columns indices (i.e.,
raise an error).  the rewritten version raises an error if any element
of any index is an NA.

these minor differences are easily modifiable should compliance with the
original 'design' be desirable.

interestingly, there is a reduction in code by some 40 lines (~30%) wrt.
the original, even though the new code is quite redundant (but thus were
the original, too).  with a little effort, it can be compressed further,
but i felt it would become more convoluted and less readable, and also
less efficient.  procedural abstraction could help, but would also
negatively impact performance.  (presumably, an implementation in c
would run faster.)

incidentally (here's the best part!), my version seems to perform much
better than the original, at least in a limited set of naive
benchmarks.  here are some results, which you can (hopefully) reproduce
using the code in the attached test.r.  the data is a dummy df with 1k
rows and 1k columns, filled with rnorm;  each indexing was repeated 1000
times for both the original and the modified version:

   original patched ratio   test                                    
1  0.002    0.001      2.00 d[]                                     
2  0.027    0.001     27.00 d[drop = FALSE]                         
3  0.025    0.002     12.50 d[drop = TRUE]                          
4  0.026    0.002     13.00 d[, drop = FALSE]                       
5  0.026    0.003      8.67 d[, drop = TRUE]                        
6  1.274    0.002    637.00 d[, ]                                   
7  1.255    0.001   1255.00 d[, , ]                                 
8  1.183    0.001   1183.00 d[, , drop = FALSE]                     
9  1.183    0.003    394.33 d[, , drop = TRUE]                      
10 0.013    0.011      1.18 d[r]                                    
11 0.040    0.034      1.18 d[r, drop = TRUE]                       
12 0.037    0.010      3.70 d[r, drop = FALSE]                      
13 0.012    0.011      1.09 d[i = r]                                
14 0.036    0.034      1.06 d[i = r, drop = TRUE]                   
15 0.037    0.011      3.36 d[i = r, drop = FALSE]                  
16 0.222    0.163      1.36 d[rr]                                   
17 0.247    0.112      2.21 d[rr, drop = FALSE]                     
18 0.204    0.144      1.42 d[rr, drop = TRUE]                      
19 0.174    0.120      1.45 d[i = rr]                               
20 0.201    0.125      1.61 d[i = rr, drop = FALSE]                 
21 0.215    0.147      1.46 d[i = rr, drop = TRUE]                  
22 2.266    1.159      1.96 d[rr, ]                                 
23 2.236    1.164      1.92 d[rr, , drop = FALSE]                   
24 2.275    1.171      1.94 d[rr, , drop = TRUE]                    
25 2.269    1.165      1.95 d[i = rr, ]                             
26 2.264    1.155      1.96 d[i = rr, , drop = FALSE]               
27 2.290    1.189      1.93 d[i = rr, , drop = TRUE]                
28 2.301    1.198      1.92 d[, i = rr]                             
29 2.239    1.158      1.93 d[, i = rr, drop = FALSE]               
30 2.310    1.161      1.99 d[, i = rr, drop = TRUE]                
31 0.002    0.003      0.67 d[j = c]                                
32 0.026    0.011      2.36 d[j = c, drop = FALSE]                  
33 0.026    0.003      8.67 d[j = c, drop = TRUE]                   
34 0.001    0.111      0.01 d[j = cc]                               
35 0.025    0.110      0.23 d[j = cc, drop = FALSE]                 
36 0.025    0.111      0.23 d[j = cc, drop = TRUE]                  
37 0.243    0.051      4.76 d[rr, cc]                               
38 0.243    0.051      4.76 d[rr, cc, drop = FALSE]                 
39 0.244    0.050      4.88 d[rr, cc, drop = TRUE]                  
40 0.244    0.051      4.78 d[i = rr, cc]                           
41 0.243    0.050      4.86 d[i = rr, cc, drop = FALSE]             
42 0.244    0.051      4.78 d[i = rr, cc, drop = TRUE]              
43 0.243    0.052      4.67 d[cc, i = rr]                           
44 0.244    0.050      4.88 d[cc, i = rr, drop = FALSE]             
45 0.247    0.052      4.75 d[cc, i = rr, drop = TRUE]              
46 0.244    0.050      4.88 d[i = rr, j = cc]                       
47 0.244    0.051      4.78 d[i = rr, j = cc, drop = FALSE]         
48 0.244    0.051      4.78 d[i = rr, j = cc, drop = TRUE]          
49 0.244    0.051      4.78 d[j = cc, i = rr]                       
50 0.243    0.051      4.76 d[j = cc, i = rr, drop = FALSE]         
51 0.245    0.051      4.80 d[j = cc, i = rr, drop = TRUE]          
52 0.002    0.155      0.01 d[j = cn]                               
53 0.429    0.139      3.09 d[i = rn, j = cn]                       
54 1.791    0.690      2.60 d[i = c(TRUE, FALSE), j = c(FALSE, TRUE)]

(note:  the benchmark relies on a feature of rbenchmark that i have just
added, so you may need to download/update the package before trying.)

in some tests, the difference is two orders of magnitude; in some it's a
factor of 2-5;  in some there's no significant difference.  in only a
few cases, the original is way faster (e.g., tests 34 and 52), but this
is because the original is wrong there (it simply ignores the index, so
no wonder).

all the expressions above used in benchmarking were also used to test
the equivalence of output from the original and the new version (see
test.r again), and all of them were negative (no difference) -- except
for the cases where the original was wrong.

i'd consider making a patch for src/library/base/R/dataframe.R, but
there's a hack here:  it seems that some code relies on some part of the
'design' that differs between the rewrite and the original, and the new
code does not make (dataframe.R does, but then other sources fail). 
anyway, sourcing the attached dataframe.R suffices for testing. 

i will be happy to learn where my implementation, benchmarking, and/or
result checking are naive or wrong in any way, as they surely are.

vQ

[1] http://cran.r-project.org/doc/manuals/R-lang.pdf
[2] http://svn.r-project.org/R/trunk/src/library/base/R/dataframe.R
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: dataframe.r
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20090327/230f7872/attachment.pl>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: test.r
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20090327/230f7872/attachment-0001.pl>