[R] sciplot question

Frank E Harrell Jr f.harrell at vanderbilt.edu
Mon May 25 21:06:18 CEST 2009

spencerg wrote:
> Frank E Harrell Jr wrote:
>> spencerg wrote:
>>> Dear Frank, et al.:
>>> Frank E Harrell Jr wrote:
>>>> <snip>
>>>> Yes; I do see a normal distribution about once every 10 years.
>>>      To what do you attribute the nonnormality you see in most cases?
>>>           (1) Unmodeled components of variance that can generate 
>>> errors in interpretation if ignored, even with bootstrapping?
>>>           (2) Honest outliers that do not relate to the phenomena of 
>>> interest and would better be removed through improved checks on data 
>>> quality, but where bootstrapping is appropriate (provided the data 
>>> are not also contaminated with (1))?
>>>           (3) Situations where the physical application dictates a 
>>> different distribution such as binomial, lognormal, gamma, etc., 
>>> possibly also contaminated with (1) and (2)?
>>>      I've fit mixtures of normals to data before, but one needs to be 
>>> careful about not carrying that to extremes, as the mixture may be a 
>>> result of (1) and therefore not replicable.
>>>      George Box once remarked that he thought most designed 
>>> experiments included split plotting that had been ignored in the 
>>> analysis.  That is only a special case of (1).
>>>      Thanks,
>>>      Spencer Graves
>> Spencer,
>> Those are all important reasons for non-normality of margin 
>> distributions.  But the biggest reason of all is that the underlying 
>> process did not know about the normal distribution.  Normality in raw 
>> data is usually an accident.
>      Frank:
>      Might there be a difference between the physical and social 
> sciences on this issue?

Hi Spencer,

I doubt that the difference is large, but biological measurements seem 
to be more of a problem.

>      The central limit effect works pretty well with many kinds of 
> manufacturing data, except that it is often masked by between-lot 
> components of variance.  The first differences in log(prices) are often 
> long-tailed and negatively skewed.  Standard GARCH and similar models 
> handle the long tails well but miss the skewness, at least in what I've 
> seen.  I think that can be fixed, but I have not yet seen it done.

The central limit theorem in and of itself doesn't help because it 
doesn't tell you how large N must be before normality holds well enough.

>      Social science data, however, often involve discrete scales where 
> the raters' interpretations of the scales rarely match any standard 
> distribution.  Transforming to latent variables, e.g., via factor 
> analysis, may help but do not eliminate the problem.

Good example.  Many of the scales I've seen are non-normal or even 

>      Thanks for your comments.

Thanks for yours

>      Spencer
>> Frank

Frank E Harrell Jr   Professor and Chair           School of Medicine
                      Department of Biostatistics   Vanderbilt University

More information about the R-help mailing list