[R] sciplot question
Frank E Harrell Jr
f.harrell at vanderbilt.edu
Mon May 25 21:06:18 CEST 2009
> Frank E Harrell Jr wrote:
>> spencerg wrote:
>>> Dear Frank, et al.:
>>> Frank E Harrell Jr wrote:
>>>> Yes; I do see a normal distribution about once every 10 years.
>>> To what do you attribute the nonnormality you see in most cases?
>>> (1) Unmodeled components of variance that can generate
>>> errors in interpretation if ignored, even with bootstrapping?
>>> (2) Honest outliers that do not relate to the phenomena of
>>> interest and would better be removed through improved checks on data
>>> quality, but where bootstrapping is appropriate (provided the data
>>> are not also contaminated with (1))?
>>> (3) Situations where the physical application dictates a
>>> different distribution such as binomial, lognormal, gamma, etc.,
>>> possibly also contaminated with (1) and (2)?
>>> I've fit mixtures of normals to data before, but one needs to be
>>> careful about not carrying that to extremes, as the mixture may be a
>>> result of (1) and therefore not replicable.
>>> George Box once remarked that he thought most designed
>>> experiments included split plotting that had been ignored in the
>>> analysis. That is only a special case of (1).
>>> Spencer Graves
>> Those are all important reasons for non-normality of margin
>> distributions. But the biggest reason of all is that the underlying
>> process did not know about the normal distribution. Normality in raw
>> data is usually an accident.
> Might there be a difference between the physical and social
> sciences on this issue?
I doubt that the difference is large, but biological measurements seem
to be more of a problem.
> The central limit effect works pretty well with many kinds of
> manufacturing data, except that it is often masked by between-lot
> components of variance. The first differences in log(prices) are often
> long-tailed and negatively skewed. Standard GARCH and similar models
> handle the long tails well but miss the skewness, at least in what I've
> seen. I think that can be fixed, but I have not yet seen it done.
The central limit theorem in and of itself doesn't help because it
doesn't tell you how large N must be before normality holds well enough.
> Social science data, however, often involve discrete scales where
> the raters' interpretations of the scales rarely match any standard
> distribution. Transforming to latent variables, e.g., via factor
> analysis, may help but do not eliminate the problem.
Good example. Many of the scales I've seen are non-normal or even
> Thanks for your comments.
Thanks for yours
Frank E Harrell Jr Professor and Chair School of Medicine
Department of Biostatistics Vanderbilt University
More information about the R-help