[Rd] sum() (and similar methods) should work for zero row data.frames

Pages, Herve hp@ge@ @end|ng |rom |redhutch@org
Fri Oct 23 23:44:24 CEST 2020


Hi,

There are 2 bugs here. The proposed fix to Summary.data.frame() is fine 
but it doesn't address the other problem reported by the OP that 
as.matrix() on a zero-row data.frame doesn't respect the type of its 
columns, like other column-combining operations do:

   df <- data.frame(a=numeric(0), b=numeric(0))

   typeof(as.matrix(df))
   # [1] "logical"

   typeof(unlist(df))
   # [1] "double"

   typeof(do.call(c, df))
   # [1] "double"

I've run myself into this in a couple of occasions (not in the context 
of Summary methods) and worked around it with something like:

   as_matrix_data_frame <- function(df)
   {
     ans <- as.matrix(df)
     if (nrow(df) == 0L)
         storage.mode(ans) <- typeof(unlist(df))
     ans
   }

No reason as.matrix.data.frame() couldn't do something similar.

Cheers,
H.


On 10/20/20 09:36, Martin Maechler wrote:
>>>>>> mb706
>>>>>>      on Sun, 18 Oct 2020 22:14:55 +0200 writes:
> 
>      >> From my side: it would be great if you (or R core) could prepare a patch, it would probably take me quite a bit longer than you since I don't have experience creating patches for R.
> 
>      > Best, Martin
> 
> Basically, just
> 
> 1.  svn co https://urldefense.proofpoint.com/v2/url?u=https-3A__svn.r-2Dproject.org_R_trunk&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=YAI4LgZvkD5k-tPHUGFX4PEjm72-6j_WxHpkdHfe_3Q&s=PpmVRjh2Jrg07bLHjlbhdBgWQWAFe6RK_J2SivC74vw&e=   R-devel
> 
> 2.  inside the R-devel source tree, find  src/library/base/R/dataframe.R
>      make the *minimal* changes there,
> 
>      (then also add some regression tests and update the help :-)
> 
> 3.  inside R-devel, do
> 
>          svn diff -x -ubw  >  mb706.patch
> 
> 4.  you've got the patch file  mb706.patch  which you could
>      attach to a bug report  on R's bugzilla
> 
>      (once you've got an account there ...
>       As you've asked for that *and* as you've proven your good
>       judgment about "true bug" vs. "not what I expected",
>       I'll create such an account for you now, in spite of the
>       fact that I'd still like to know a bit more than "Martin
>       mb706" about you  ...)
> 
> The changes have been committed to R-devel a quarter of an hour ago.
> We will keep them in R-devel (currently planned to become R
> 4.1.0 in spring 2021), and not port to the R-4.0.z branch, as
> the change is something like an API change, and also because
> nobody had ever reported this as an issue to our knowledge.
> 
> Thank you, Martin B706 for bringing the issue up,  and Gabe and Peter
> for chiming in !!
> 
> Best regards,
> Martin Maechler
> ETH Zurich  and  R core team
>      
> 
>      > On Sun, Oct 18, 2020, at 21:49, Gabriel Becker wrote:
>      >> Peter et al,
>      >>
>      >> I had the same thought, in particular for any() and all(), which in as
>      >> much as they should work on data.frames in the first place (which to be
>      >> perfectly honest i do find quite debatable myself), should certainly
>      >> work on "logical" data.frames if they are going to work on "numeric"
>      >> ones.
>      >>
>      >> I can volunteer to prepare a patch if Martin (the reporter) did not
>      >> want to take a crack at it, and further if it is not already being done
>      >> within R-core.
>      >>
>      >> Best,
>      >> ~G
>      >>
>      >> On Sun, Oct 18, 2020 at 12:19 AM peter dalgaard <pdalgd using gmail.com> wrote:
>      >> > Hmm, yes, this is probably wrong. E.g., we are likely to get inconsistencies out of boundary cases like this
>      >> >
>      >> > > a <- na.omit(airquality)
>      >> > > sum(a)
>      >> > [1] 37495.3
>      >> > > sum(a[FALSE,])
>      >> > Error in FUN(X[[i]], ...) :
>      >> >   only defined on a data frame with all numeric variables
>      >> >
>      >> > Or, closer to an actual use case:
>      >> >
>      >> > > sum(subset(a, Ozone>100))
>      >> > [1] 3330.5
>      >> > > sum(subset(a, Ozone>200))
>      >> > Error in FUN(X[[i]], ...) :
>      >> >   only defined on a data frame with all numeric variables
>      >> >
>      >> >
>      >> > However, given that numeric summaries generally treat logicals as 0/1, wouldn't it be easiest just to extend the check inside Summary.data.frame with "&& !is.logical(x)"?
>      >> >
>      >> > > sum(as.matrix(a[FALSE,]))
>      >> > [1] 0
>      >> >
>      >> > -pd
>      >> >
>      >> > > On 17 Oct 2020, at 21:18 , Martin <rdev using mb706.com> wrote:
>      >> > >
>      >> > > The "Summary" group generics always throw errors for a data.frame with zero rows, for example:
>      >> > >> sum(data.frame(x = numeric(0)))
>      >> > > #> Error in FUN(X[[i]], ...) :
>      >> > > #>   only defined on a data frame with all numeric variables
>      >> > > Same behaviour for min, max, any, all, ... . I believe this is inconsistent with what these methods do for other empty objects (vectors, matrices), where the return value is chosen to ensure transitivity: sum(numeric(0)) == 0.
>      >> > >
>      >> > > The reason for this is that the return type of as.matrix() for empty (no rows or no columns) data.frame objects is always a matrix of type "logical". The Summary method for data.frame, in turn, throws an error when the data.frame, converted to a matrix, is not of numeric type.
>      >> > >
>      >> > > I suggest two ways that make sum, min, max, ... more consistent. IMHO it would be fitting to implement both of these fixes, because they also make other things more consistent.
>      >> > >
>      >> > > 1. Make the return type of as.matrix() for zero-row data.frames consistent with the type that would have been returned, had the data.frame had more than zero rows. "as.matrix(data.frame(x = numeric(0)))" should then be numeric, if there is an empty "character" column the return matrix should be a character etc. This would make subsetting by row and conversion to matrix commute (except for row names sometimes):
>      >> > >> all.equal(as.matrix(df[rows, , drop = FALSE]), as.matrix(df)[rows, , drop = FALSE])
>      >> > > Furthermore, this change would make as.matrix.data.frame obey the documentation, which indicates that the coercion hierarchy is used for the return type.
>      >> > >
>      >> > > 2. Make the Summary.data.frame method accept data.frames that produce non-numeric matrices. Next to the main focus of this message, I believe it would e.g. be fitting to have any() and all() work on logical data.frame objects. The current behaviour is such that
>      >> > >> any(data.frame(x = 1))
>      >> > > #> [1] TRUE
>      >> > > #> Warning message:
>      >> > > #> In any(1, na.rm = FALSE) : coercing argument of type 'double' to logical
>      >> > > and
>      >> > >> any(data.frame(x = TRUE))
>      >> > > #> Error in FUN(X[[i]], ...) :
>      >> > > #>   only defined on a data frame with all numeric variables
>      >> > > So a numeric data.frame warns about implicit coercion, while a logical data.frame (which would not need coercion) does not work at all.
>      >> > >
>      >> > > (I feel more strongly about fixing 1. than 2., because I don't know the discussion that lead to the behaviour described in 2.)
>      >> > >
>      >> > > Best,
>      >> > > Martin
>      >> > >
>      >> > > ______________________________________________
>      >> > > R-devel using r-project.org mailing list
>      >> > > https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Ddevel&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=YAI4LgZvkD5k-tPHUGFX4PEjm72-6j_WxHpkdHfe_3Q&s=q0b1qGN5IxjiKAeQYAkmEKNdqyTOXnuIAFtuPTiPli8&e=
>      >> >
>      >> > --
>      >> > Peter Dalgaard, Professor,
>      >> > Center for Statistics, Copenhagen Business School
>      >> > Solbjerg Plads 3, 2000 Frederiksberg, Denmark
>      >> > Phone: (+45)38153501
>      >> > Office: A 4.23
>      >> > Email: pd.mes using cbs.dk  Priv: PDalgd using gmail.com
>      >> >
>      >> > ______________________________________________
>      >> > R-devel using r-project.org mailing list
>      >> > https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Ddevel&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=YAI4LgZvkD5k-tPHUGFX4PEjm72-6j_WxHpkdHfe_3Q&s=q0b1qGN5IxjiKAeQYAkmEKNdqyTOXnuIAFtuPTiPli8&e=
> 
>      > ______________________________________________
>      > R-devel using r-project.org mailing list
>      > https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Ddevel&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=YAI4LgZvkD5k-tPHUGFX4PEjm72-6j_WxHpkdHfe_3Q&s=q0b1qGN5IxjiKAeQYAkmEKNdqyTOXnuIAFtuPTiPli8&e=
> 
> ______________________________________________
> R-devel using r-project.org mailing list
> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Ddevel&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=YAI4LgZvkD5k-tPHUGFX4PEjm72-6j_WxHpkdHfe_3Q&s=q0b1qGN5IxjiKAeQYAkmEKNdqyTOXnuIAFtuPTiPli8&e=
> 

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages using fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319


More information about the R-devel mailing list