[Rd] Inconsistent handling of data frames in min(), max(), and mean()
Gavin Simpson
ucfagls at gmail.com
Fri Aug 22 20:51:15 CEST 2014
Thanks Martin
(sorry about the HTML - GMail and my incompetent use of it; hopefully I've
beaten it into submission this time).
I can see the point of view, however the inconsistency remains whether one
patches the other summary stat functions to work as if given a matrix or
squash all the Summary.data.frame methods as well.
More comments in-line
On 22 August 2014 02:23, Martin Maechler <maechler at stat.math.ethz.ch> wrote:
> >>>>> Gavin Simpson <ucfagls at gmail.com>
> >>>>> on Thu, 21 Aug 2014 12:32:31 -0600 writes:
>
<snip/>
> >> mean(df)
> > [1] NA Warning message: In mean.default(df) : argument is
> > not numeric or logical: returning NA
>
> I would tend to agree (:-) that mean() should rather give an error here
> (and read on).
>
> > I recall the times where `mean(df)` would give
> > `colMeans(df)` and this behaviour was deemed
> > inconsistent.
>
> > It seems though that the change has removed one
> > inconsistency and replaced it with another.
>
> The whole idea of removing the mean method for data frames was
> that there are many more summary functions, e.g. median, and it
> seems wrong to write a data frame method for each of them; then
> why for *some* of them.
> So we *did* keep the Summary.data.frame group method,
> and that's why min(), max(), sum(),.. work {though sum() will be
> slightly slower than colSums()}.
>
and gives a different answer, unless you meant sum(colSums(df)) == sum(df)?
> When teaching R, the audience should learn to use apply() or
> similar functions, e.g. from the hadleyverse,
> because that is the general approach of dealing with matrix-like
> objects that is indeed how I think users should start thinking
> of data frames.
This actually came up because someone was wanting the mean over all columns
(of a dataset where columns represented repeated measures per patient,
rows), hence `apply()` is not really suitable here and we've switched the
example to do `mean(as.matrix(df))` to get what they wanted.
I wasn't suggesting having `mean()` do anything like `colMeans()` or the
`mean.data.frame` of old.
I was wondering why we couldn't gain some semblance of consistency by
making *all* (although I didn't mention them) these related functions work
on a data frame (with all numeric columns) as if it were a matrix, just
like `min()`, `max()`, `range()` etc do now.
> Am I missing good reasons why there couldn't be a
> > `mean.data.frame()` method which worked like `max()` etc
> > when given a data frame?
> yes, see above.
> [ There's no consistent end after that: Why is median() different, why
> would
> sd(), var(), ... not work ?]
I don't see why they shouldn't if `max()` etc work *for an entirely numeric
data frame*.
> > Namely that they return the
> required statistic *only* when presented with a data frame
> > of all numeric variables? E.g.
>
<snip />
> > I just can't see the sense in having `mean` work the way
> > it does now?
>
> I agree. It would be better to give an error.
> E.g., mean.default could start with
>
> if(is.object(x))
> stop("there is no mean() method for ", class(x)[1], " objects")
That would give a nicer error message but wouldn't solve the deeper issue
of a lack of consistency, which *is* an issue for people when trying to
learn R.
So, can't we either kill off the summary group method for data frames or
identify a set of functions which should work similarly to the existing
summary group method members? Assuming that a patch would be forthcoming
with documentation rather than relying on RCore to do this manually?
> > Thanks,
> > Gavin
>
> > --
>
> > Gavin Simpson, PhD
>
> > [[alternative HTML version deleted]]
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> ( hmmm... and that on R-devel ... )
>
Yeah, sorry. Hopefully fixed now!
G
--
Gavin Simpson, PhD
[[alternative HTML version deleted]]
More information about the R-devel
mailing list