[R] something missing in summary()

Douglas Bates bates at stat.wisc.edu
Fri Feb 16 14:58:01 CET 2007


On 2/16/07, Jari Oksanen <jarioksa at sun3.oulu.fi> wrote:
> Gerard Smits g_smits at verizon.net Fri Feb 16 00:46:09 CET 2007:
> > just noticed that two key pieces of information are not given by
> > the summary() command:  N and SD.  we are given the N missing, but
> > not the converse.  I know these summary value can be obtained easy,
> > but can't understand why these two pieces of information are not
> > provided with the other info.
> >
> I assume you mean summary.data.frame?

Given a data frame, df, I would use

str(df)

before

summary(df)

because I want to see, for example, which columns are factors or
ordered factors or ...  That information is present in the value of
summary(df) but in a more subtle way.  As pointed out below the number
of rows in the data frame is the total number of observations for each
of the variables so putting that information in the summary for each
variable is redundant.

> There has even been an "appeal" on this:
> http://tolstoy.newcastle.edu.au/R/help/06/02/20706.html
>
> However, I didn't find any petition you could sign (but I found many
> surprising petitions when googling on this). Perhaps somebody will set
> up a petition page some day.
>
> With time, I've learnt that if something obvious is missing in the base
> R, there is a reason. Probably the Core thinks that you shouldn't use sd
> in a summary, but it is a poor and misleading statistic (they neither
> have skewness and kurtosis). You may learn to live without sd if you
> survive over the first impact.

I don't think this was an explicit decision by R-core.  It was a case
of S compatibility so the original decision was made at Bell Labs and
that group was highly influenced by John Tukey who worked with them. I
imagine that is why the summary of a numeric is a 'five-number'
summary plus the mean.  I would say the surprising and unconventional
part of that summary is the fact that it includes the mean.

> On the other hand, there are things like R-squared and significance
> stars in summary.lm, which spoils the image of purity in the Core.

However there is the option show.signif.stars which can be set to
FALSE and which I always do.

> Number of observations may not be very useful in summary.data.frame,
> because it varies so little among variables.
>
> The R-help message cited above and its follow-ups suggest some ways of
> locally modifying the code and maintaining the modifications over the
> upgrades of R.
>
> Best wishes, Jari Oksanen
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list