[R] sample function

(Ted Harding) Ted.Harding at nessie.mcc.ac.uk
Fri Mar 11 11:59:55 CET 2005


On 11-Mar-05 Martin C. Martin wrote:
> "hist" is lumping things together.
> 
> Try:
> sum(temp == 0)
> 
> compare to the height of the left most bar.
> 
> Is this a bug in hist?
> 
> - Martin

Well, not a bug strictly speaking since "it works as documented",
but I do think it's not necessarily a happy choice.

The unsuspecting (like Martin) will step into holes even after
reading "?hist", since the truths are rather deeply (and I think
somewhat obliquely) hidden ("?hist" leads you to look up
"?nclass.Sturges" which in turn only mentions "Sturges' formula"
and invites you to read V&R's MASS book and other references
in the hope of further clarification -- all a bit much when
you just want to draw a histogram, which ought to be kid's
stuff! Not to mention the things to do with parameters
"include.lowest" and "right" whose combined effect is not
too obvious).

I'd like to repeat the sort of hint I occasionally give:

In using R, if there's any doubt it is best to spell out exactly
what you want rather than expecting the functions to agree with
what you want. R functions are often more complex and subtle
than you might suspect.

In this particular case,

  hist(temp,breaks= -0.5+(-0:14) )

will produce the sort of thing which is wanted. One could
interpret the results which Martin reported as due to a
sort of "confusion" (but on whose part -- R or Martin?)
over the fact that "hist" is designed to deal with
"continuous" values, while his sample consists of integers.

For that particular case, one could also use "table" or
"barchart", as has been suggested by David Scott, which
would produce a plot of similar appearance; but this is
not in the "histogram family" despite appearances, since
it is not primarily a "quantitative" plot (i.e. respecting
the numerical values and their numerical comparisons), but
more a "catefory count". In particular, natural variants
of the above "hist" command such as

  hist(temp,breaks= -0.5+2*(0:7) )

(which corresponds to binning by different intervals) do
not lie so easily in the "table" or "barchart" domain.

And I don't agree with David's comment that "No, hist
is the wrong thing to use to display this data."

In so far as these data are considered to be numerical
values of which one wants a view of their distribution,
then "hist" is entirely appropriate, as for any other
numerical variable. The only question is how to get
this to happen appropriately.

Would David make the same comment about data sampled
from (0:5000) instead of (0:12)?

Best wishes to all,
Ted.


--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 11-Mar-05                                       Time: 10:59:55
------------------------------ XFMail ------------------------------




More information about the R-help mailing list