[R] sciplot question

Mon May 25 21:06:18 CEST 2009

spencerg wrote:
> Frank E Harrell Jr wrote:
>> spencerg wrote:
>>> Dear Frank, et al.:
>>>
>>> Frank E Harrell Jr wrote:
>>>> <snip>
>>>> Yes; I do see a normal distribution about once every 10 years.
>>>
>>>      To what do you attribute the nonnormality you see in most cases?
>>>           (1) Unmodeled components of variance that can generate 
>>> errors in interpretation if ignored, even with bootstrapping?
>>>
>>>           (2) Honest outliers that do not relate to the phenomena of 
>>> interest and would better be removed through improved checks on data 
>>> quality, but where bootstrapping is appropriate (provided the data 
>>> are not also contaminated with (1))?
>>>
>>>           (3) Situations where the physical application dictates a 
>>> different distribution such as binomial, lognormal, gamma, etc., 
>>> possibly also contaminated with (1) and (2)?
>>>
>>>      I've fit mixtures of normals to data before, but one needs to be 
>>> careful about not carrying that to extremes, as the mixture may be a 
>>> result of (1) and therefore not replicable.
>>>
>>>      George Box once remarked that he thought most designed 
>>> experiments included split plotting that had been ignored in the 
>>> analysis.  That is only a special case of (1).
>>>
>>>      Thanks,
>>>      Spencer Graves
>>
>> Spencer,
>>
>> Those are all important reasons for non-normality of margin 
>> distributions.  But the biggest reason of all is that the underlying 
>> process did not know about the normal distribution.  Normality in raw 
>> data is usually an accident.
> 
>      Frank:
> 
>      Might there be a difference between the physical and social 
> sciences on this issue?

Hi Spencer,

I doubt that the difference is large, but biological measurements seem 
to be more of a problem.

> 
>      The central limit effect works pretty well with many kinds of 
> manufacturing data, except that it is often masked by between-lot 
> components of variance.  The first differences in log(prices) are often 
> long-tailed and negatively skewed.  Standard GARCH and similar models 
> handle the long tails well but miss the skewness, at least in what I've 
> seen.  I think that can be fixed, but I have not yet seen it done.

The central limit theorem in and of itself doesn't help because it 
doesn't tell you how large N must be before normality holds well enough.

> 
>      Social science data, however, often involve discrete scales where 
> the raters' interpretations of the scales rarely match any standard 
> distribution.  Transforming to latent variables, e.g., via factor 
> analysis, may help but do not eliminate the problem.

Good example.  Many of the scales I've seen are non-normal or even 
multi-modal.

> 
>      Thanks for your comments.

Thanks for yours
Frank

>      Spencer
>>
>> Frank
>>
> 
> 

-- 
Frank E Harrell Jr   Professor and Chair           School of Medicine
                      Department of Biostatistics   Vanderbilt University