[Rd] sum() (and similar methods) should work for zero row data.frames
Pages, Herve
hp@ge@ @end|ng |rom |redhutch@org
Fri Oct 23 23:44:24 CEST 2020
Hi,
There are 2 bugs here. The proposed fix to Summary.data.frame() is fine
but it doesn't address the other problem reported by the OP that
as.matrix() on a zero-row data.frame doesn't respect the type of its
columns, like other column-combining operations do:
df <- data.frame(a=numeric(0), b=numeric(0))
typeof(as.matrix(df))
# [1] "logical"
typeof(unlist(df))
# [1] "double"
typeof(do.call(c, df))
# [1] "double"
I've run myself into this in a couple of occasions (not in the context
of Summary methods) and worked around it with something like:
as_matrix_data_frame <- function(df)
{
ans <- as.matrix(df)
if (nrow(df) == 0L)
storage.mode(ans) <- typeof(unlist(df))
ans
}
No reason as.matrix.data.frame() couldn't do something similar.
Cheers,
H.
On 10/20/20 09:36, Martin Maechler wrote:
>>>>>> mb706
>>>>>> on Sun, 18 Oct 2020 22:14:55 +0200 writes:
>
> >> From my side: it would be great if you (or R core) could prepare a patch, it would probably take me quite a bit longer than you since I don't have experience creating patches for R.
>
> > Best, Martin
>
> Basically, just
>
> 1. svn co https://urldefense.proofpoint.com/v2/url?u=https-3A__svn.r-2Dproject.org_R_trunk&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=YAI4LgZvkD5k-tPHUGFX4PEjm72-6j_WxHpkdHfe_3Q&s=PpmVRjh2Jrg07bLHjlbhdBgWQWAFe6RK_J2SivC74vw&e= R-devel
>
> 2. inside the R-devel source tree, find src/library/base/R/dataframe.R
> make the *minimal* changes there,
>
> (then also add some regression tests and update the help :-)
>
> 3. inside R-devel, do
>
> svn diff -x -ubw > mb706.patch
>
> 4. you've got the patch file mb706.patch which you could
> attach to a bug report on R's bugzilla
>
> (once you've got an account there ...
> As you've asked for that *and* as you've proven your good
> judgment about "true bug" vs. "not what I expected",
> I'll create such an account for you now, in spite of the
> fact that I'd still like to know a bit more than "Martin
> mb706" about you ...)
>
> The changes have been committed to R-devel a quarter of an hour ago.
> We will keep them in R-devel (currently planned to become R
> 4.1.0 in spring 2021), and not port to the R-4.0.z branch, as
> the change is something like an API change, and also because
> nobody had ever reported this as an issue to our knowledge.
>
> Thank you, Martin B706 for bringing the issue up, and Gabe and Peter
> for chiming in !!
>
> Best regards,
> Martin Maechler
> ETH Zurich and R core team
>
>
> > On Sun, Oct 18, 2020, at 21:49, Gabriel Becker wrote:
> >> Peter et al,
> >>
> >> I had the same thought, in particular for any() and all(), which in as
> >> much as they should work on data.frames in the first place (which to be
> >> perfectly honest i do find quite debatable myself), should certainly
> >> work on "logical" data.frames if they are going to work on "numeric"
> >> ones.
> >>
> >> I can volunteer to prepare a patch if Martin (the reporter) did not
> >> want to take a crack at it, and further if it is not already being done
> >> within R-core.
> >>
> >> Best,
> >> ~G
> >>
> >> On Sun, Oct 18, 2020 at 12:19 AM peter dalgaard <pdalgd using gmail.com> wrote:
> >> > Hmm, yes, this is probably wrong. E.g., we are likely to get inconsistencies out of boundary cases like this
> >> >
> >> > > a <- na.omit(airquality)
> >> > > sum(a)
> >> > [1] 37495.3
> >> > > sum(a[FALSE,])
> >> > Error in FUN(X[[i]], ...) :
> >> > only defined on a data frame with all numeric variables
> >> >
> >> > Or, closer to an actual use case:
> >> >
> >> > > sum(subset(a, Ozone>100))
> >> > [1] 3330.5
> >> > > sum(subset(a, Ozone>200))
> >> > Error in FUN(X[[i]], ...) :
> >> > only defined on a data frame with all numeric variables
> >> >
> >> >
> >> > However, given that numeric summaries generally treat logicals as 0/1, wouldn't it be easiest just to extend the check inside Summary.data.frame with "&& !is.logical(x)"?
> >> >
> >> > > sum(as.matrix(a[FALSE,]))
> >> > [1] 0
> >> >
> >> > -pd
> >> >
> >> > > On 17 Oct 2020, at 21:18 , Martin <rdev using mb706.com> wrote:
> >> > >
> >> > > The "Summary" group generics always throw errors for a data.frame with zero rows, for example:
> >> > >> sum(data.frame(x = numeric(0)))
> >> > > #> Error in FUN(X[[i]], ...) :
> >> > > #> only defined on a data frame with all numeric variables
> >> > > Same behaviour for min, max, any, all, ... . I believe this is inconsistent with what these methods do for other empty objects (vectors, matrices), where the return value is chosen to ensure transitivity: sum(numeric(0)) == 0.
> >> > >
> >> > > The reason for this is that the return type of as.matrix() for empty (no rows or no columns) data.frame objects is always a matrix of type "logical". The Summary method for data.frame, in turn, throws an error when the data.frame, converted to a matrix, is not of numeric type.
> >> > >
> >> > > I suggest two ways that make sum, min, max, ... more consistent. IMHO it would be fitting to implement both of these fixes, because they also make other things more consistent.
> >> > >
> >> > > 1. Make the return type of as.matrix() for zero-row data.frames consistent with the type that would have been returned, had the data.frame had more than zero rows. "as.matrix(data.frame(x = numeric(0)))" should then be numeric, if there is an empty "character" column the return matrix should be a character etc. This would make subsetting by row and conversion to matrix commute (except for row names sometimes):
> >> > >> all.equal(as.matrix(df[rows, , drop = FALSE]), as.matrix(df)[rows, , drop = FALSE])
> >> > > Furthermore, this change would make as.matrix.data.frame obey the documentation, which indicates that the coercion hierarchy is used for the return type.
> >> > >
> >> > > 2. Make the Summary.data.frame method accept data.frames that produce non-numeric matrices. Next to the main focus of this message, I believe it would e.g. be fitting to have any() and all() work on logical data.frame objects. The current behaviour is such that
> >> > >> any(data.frame(x = 1))
> >> > > #> [1] TRUE
> >> > > #> Warning message:
> >> > > #> In any(1, na.rm = FALSE) : coercing argument of type 'double' to logical
> >> > > and
> >> > >> any(data.frame(x = TRUE))
> >> > > #> Error in FUN(X[[i]], ...) :
> >> > > #> only defined on a data frame with all numeric variables
> >> > > So a numeric data.frame warns about implicit coercion, while a logical data.frame (which would not need coercion) does not work at all.
> >> > >
> >> > > (I feel more strongly about fixing 1. than 2., because I don't know the discussion that lead to the behaviour described in 2.)
> >> > >
> >> > > Best,
> >> > > Martin
> >> > >
> >> > > ______________________________________________
> >> > > R-devel using r-project.org mailing list
> >> > > https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Ddevel&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=YAI4LgZvkD5k-tPHUGFX4PEjm72-6j_WxHpkdHfe_3Q&s=q0b1qGN5IxjiKAeQYAkmEKNdqyTOXnuIAFtuPTiPli8&e=
> >> >
> >> > --
> >> > Peter Dalgaard, Professor,
> >> > Center for Statistics, Copenhagen Business School
> >> > Solbjerg Plads 3, 2000 Frederiksberg, Denmark
> >> > Phone: (+45)38153501
> >> > Office: A 4.23
> >> > Email: pd.mes using cbs.dk Priv: PDalgd using gmail.com
> >> >
> >> > ______________________________________________
> >> > R-devel using r-project.org mailing list
> >> > https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Ddevel&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=YAI4LgZvkD5k-tPHUGFX4PEjm72-6j_WxHpkdHfe_3Q&s=q0b1qGN5IxjiKAeQYAkmEKNdqyTOXnuIAFtuPTiPli8&e=
>
> > ______________________________________________
> > R-devel using r-project.org mailing list
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Ddevel&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=YAI4LgZvkD5k-tPHUGFX4PEjm72-6j_WxHpkdHfe_3Q&s=q0b1qGN5IxjiKAeQYAkmEKNdqyTOXnuIAFtuPTiPli8&e=
>
> ______________________________________________
> R-devel using r-project.org mailing list
> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Ddevel&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=YAI4LgZvkD5k-tPHUGFX4PEjm72-6j_WxHpkdHfe_3Q&s=q0b1qGN5IxjiKAeQYAkmEKNdqyTOXnuIAFtuPTiPli8&e=
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages using fredhutch.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the R-devel
mailing list