[Rd] sum() (and similar methods) should work for zero row data.frames

Martin Maechler m@ech|er @end|ng |rom @t@t@m@th@ethz@ch
Tue Oct 20 18:36:37 CEST 2020


>>>>> mb706  
>>>>>     on Sun, 18 Oct 2020 22:14:55 +0200 writes:

    >> From my side: it would be great if you (or R core) could prepare a patch, it would probably take me quite a bit longer than you since I don't have experience creating patches for R.

    > Best, Martin

Basically, just

1.  svn co https://svn.r-project.org/R/trunk  R-devel

2.  inside the R-devel source tree, find  src/library/base/R/dataframe.R
    make the *minimal* changes there,

    (then also add some regression tests and update the help :-)

3.  inside R-devel, do

        svn diff -x -ubw  >  mb706.patch

4.  you've got the patch file  mb706.patch  which you could
    attach to a bug report  on R's bugzilla

    (once you've got an account there ...
     As you've asked for that *and* as you've proven your good
     judgment about "true bug" vs. "not what I expected",
     I'll create such an account for you now, in spite of the 
     fact that I'd still like to know a bit more than "Martin
     mb706" about you  ...)

The changes have been committed to R-devel a quarter of an hour ago.
We will keep them in R-devel (currently planned to become R 
4.1.0 in spring 2021), and not port to the R-4.0.z branch, as
the change is something like an API change, and also because
nobody had ever reported this as an issue to our knowledge.

Thank you, Martin B706 for bringing the issue up,  and Gabe and Peter
for chiming in !!

Best regards,
Martin Maechler
ETH Zurich  and  R core team
    

    > On Sun, Oct 18, 2020, at 21:49, Gabriel Becker wrote:
    >> Peter et al,
    >> 
    >> I had the same thought, in particular for any() and all(), which in as 
    >> much as they should work on data.frames in the first place (which to be 
    >> perfectly honest i do find quite debatable myself), should certainly 
    >> work on "logical" data.frames if they are going to work on "numeric" 
    >> ones. 
    >> 
    >> I can volunteer to prepare a patch if Martin (the reporter) did not 
    >> want to take a crack at it, and further if it is not already being done 
    >> within R-core.
    >> 
    >> Best,
    >> ~G
    >> 
    >> On Sun, Oct 18, 2020 at 12:19 AM peter dalgaard <pdalgd using gmail.com> wrote:
    >> > Hmm, yes, this is probably wrong. E.g., we are likely to get inconsistencies out of boundary cases like this
    >> > 
    >> > > a <- na.omit(airquality)
    >> > > sum(a)
    >> > [1] 37495.3
    >> > > sum(a[FALSE,])
    >> > Error in FUN(X[[i]], ...) : 
    >> >   only defined on a data frame with all numeric variables
    >> > 
    >> > Or, closer to an actual use case:
    >> > 
    >> > > sum(subset(a, Ozone>100))
    >> > [1] 3330.5
    >> > > sum(subset(a, Ozone>200))
    >> > Error in FUN(X[[i]], ...) : 
    >> >   only defined on a data frame with all numeric variables
    >> > 
    >> > 
    >> > However, given that numeric summaries generally treat logicals as 0/1, wouldn't it be easiest just to extend the check inside Summary.data.frame with "&& !is.logical(x)"?
    >> > 
    >> > > sum(as.matrix(a[FALSE,]))
    >> > [1] 0
    >> > 
    >> > -pd
    >> > 
    >> > > On 17 Oct 2020, at 21:18 , Martin <rdev using mb706.com> wrote:
    >> > > 
    >> > > The "Summary" group generics always throw errors for a data.frame with zero rows, for example:
    >> > >> sum(data.frame(x = numeric(0)))
    >> > > #> Error in FUN(X[[i]], ...) : 
    >> > > #>   only defined on a data frame with all numeric variables
    >> > > Same behaviour for min, max, any, all, ... . I believe this is inconsistent with what these methods do for other empty objects (vectors, matrices), where the return value is chosen to ensure transitivity: sum(numeric(0)) == 0.
    >> > > 
    >> > > The reason for this is that the return type of as.matrix() for empty (no rows or no columns) data.frame objects is always a matrix of type "logical". The Summary method for data.frame, in turn, throws an error when the data.frame, converted to a matrix, is not of numeric type.
    >> > > 
    >> > > I suggest two ways that make sum, min, max, ... more consistent. IMHO it would be fitting to implement both of these fixes, because they also make other things more consistent.
    >> > > 
    >> > > 1. Make the return type of as.matrix() for zero-row data.frames consistent with the type that would have been returned, had the data.frame had more than zero rows. "as.matrix(data.frame(x = numeric(0)))" should then be numeric, if there is an empty "character" column the return matrix should be a character etc. This would make subsetting by row and conversion to matrix commute (except for row names sometimes):
    >> > >> all.equal(as.matrix(df[rows, , drop = FALSE]), as.matrix(df)[rows, , drop = FALSE])
    >> > > Furthermore, this change would make as.matrix.data.frame obey the documentation, which indicates that the coercion hierarchy is used for the return type.
    >> > > 
    >> > > 2. Make the Summary.data.frame method accept data.frames that produce non-numeric matrices. Next to the main focus of this message, I believe it would e.g. be fitting to have any() and all() work on logical data.frame objects. The current behaviour is such that
    >> > >> any(data.frame(x = 1))
    >> > > #> [1] TRUE
    >> > > #> Warning message:
    >> > > #> In any(1, na.rm = FALSE) : coercing argument of type 'double' to logical
    >> > > and
    >> > >> any(data.frame(x = TRUE))
    >> > > #> Error in FUN(X[[i]], ...) : 
    >> > > #>   only defined on a data frame with all numeric variables
    >> > > So a numeric data.frame warns about implicit coercion, while a logical data.frame (which would not need coercion) does not work at all.
    >> > > 
    >> > > (I feel more strongly about fixing 1. than 2., because I don't know the discussion that lead to the behaviour described in 2.)
    >> > > 
    >> > > Best,
    >> > > Martin
    >> > > 
    >> > > ______________________________________________
    >> > > R-devel using r-project.org mailing list
    >> > > https://stat.ethz.ch/mailman/listinfo/r-devel
    >> > 
    >> > -- 
    >> > Peter Dalgaard, Professor,
    >> > Center for Statistics, Copenhagen Business School
    >> > Solbjerg Plads 3, 2000 Frederiksberg, Denmark
    >> > Phone: (+45)38153501
    >> > Office: A 4.23
    >> > Email: pd.mes using cbs.dk  Priv: PDalgd using gmail.com
    >> > 
    >> > ______________________________________________
    >> > R-devel using r-project.org mailing list
    >> > https://stat.ethz.ch/mailman/listinfo/r-devel

    > ______________________________________________
    > R-devel using r-project.org mailing list
    > https://stat.ethz.ch/mailman/listinfo/r-devel



More information about the R-devel mailing list