[Rd] Inconsistent handling of data frames in min(), max(), and mean()

Fri Aug 22 20:51:15 CEST 2014

Thanks Martin

(sorry about the HTML - GMail and my incompetent use of it; hopefully I've
beaten it into submission this time).

I can see the point of view, however the inconsistency remains whether one
patches the other summary stat functions to work as if given a matrix or
squash all the Summary.data.frame methods as well.

More comments in-line

On 22 August 2014 02:23, Martin Maechler <maechler at stat.math.ethz.ch> wrote:

> >>>>> Gavin Simpson <ucfagls at gmail.com>
> >>>>>     on Thu, 21 Aug 2014 12:32:31 -0600 writes:
>
 <snip/>

>     >> mean(df)
>     > [1] NA Warning message: In mean.default(df) : argument is
>     > not numeric or logical: returning NA
>
> I would tend to agree (:-) that mean() should rather give an error here
> (and read on).
>
>     > I recall the times where `mean(df)` would give
>     > `colMeans(df)` and this behaviour was deemed
>     > inconsistent.
>
>     > It seems though that the change has removed one
>     > inconsistency and replaced it with another.
>
> The whole idea of removing the mean method for data frames was
> that there are many more summary functions, e.g. median, and it
> seems wrong to write a data frame method for each of them; then
> why for *some* of them.
> So we *did* keep the  Summary.data.frame  group method,
> and that's why min(), max(), sum(),.. work  {though sum() will be
> slightly slower than colSums()}.
>

and gives a different answer, unless you meant sum(colSums(df)) == sum(df)?

> When teaching R, the audience should learn to use  apply() or
> similar functions, e.g. from the hadleyverse,
> because that is the general approach of dealing with matrix-like
> objects that is indeed how I think users should start thinking
> of data frames.

This actually came up because someone was wanting the mean over all columns
(of a dataset where columns represented repeated measures per patient,
rows), hence `apply()` is not really suitable here and we've switched the
example to do `mean(as.matrix(df))` to get what they wanted.

I wasn't suggesting having `mean()` do anything like `colMeans()` or the
`mean.data.frame` of old.

I was wondering why we couldn't gain some semblance of consistency by
making *all* (although I didn't mention them) these related functions work
on a data frame (with all numeric columns) as if it were a matrix, just
like `min()`, `max()`, `range()` etc do now.

    > Am I missing good reasons why there couldn't be a
>     > `mean.data.frame()` method which worked like `max()` etc
>     > when given a data frame?
> yes, see above.
> [ There's no consistent end after that: Why is median() different, why
> would
>  sd(), var(), ... not work ?]

I don't see why they shouldn't if `max()` etc work *for an entirely numeric
data frame*.

>     >  Namely that they return the

    > required statistic *only* when presented with a data frame
>     > of all numeric variables? E.g.
>
<snip />

>     > I just can't see the sense in having `mean` work the way
>     > it does now?
>
> I agree. It would be better to give an error.
> E.g.,  mean.default could start with
>
>     if(is.object(x))
>        stop("there is no mean() method for ", class(x)[1], " objects")

That would give a nicer error message but wouldn't solve the deeper issue
of a lack of consistency, which *is* an issue for people when trying to
learn R.

So, can't we either kill off the summary group method for data frames or
identify a set of functions which should work similarly to the existing
summary group method members? Assuming that a patch would be forthcoming
with documentation rather than relying on RCore to do this manually?

>     > Thanks,
>     > Gavin
>
>     > --
>
>     > Gavin Simpson, PhD
>
>     >   [[alternative HTML version deleted]]
>         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>  ( hmmm... and that on R-devel ... )
>

Yeah, sorry. Hopefully fixed now!

G

-- 
Gavin Simpson, PhD

	[[alternative HTML version deleted]]