[R] Normalization and missing values

Berton Gunter gunter.berton at gene.com
Wed Apr 13 19:07:56 CEST 2005


Normalization:  ?scale -- or, more usually, an argument in the clustering
function (see package "cluster" where "stand" is the argument in the various
functions. Other packages may have similar capabilties).

Missing Values: A HUGE and COMPLEX issue. One Reference: ANALYSIS OF
INCOMPLETE MULTIVARIATE DATA by J.L. Schafer (Chapman and Hall); Donald
Rubin has published several books and many papers on this, so anything by
him is another good resource.

Setting missings to 0 will clearly produce nonsense, as two cases with lots
of missings in corresponding coordinates will cluster together when there is
no reason for them to do so. Set them to NA, but as some clustering routines
work only with complete cases, this might leave you with a data set of size
0. So you need clustering methods that can work with missing data, e.g. pam,
clara, etc.; but of course one doesn't quite know what to make of two cases
that are deemed to be "close" on the basis of, say, 10% of nonmissing shared
coordinates as compared to cases that are close based on all shared
coordinates. You can't expect statistical procedures to rescue you from poor
data.


-- Bert Gunter
Genentech Non-Clinical Statistics
South San Francisco, CA
 
"The business of the statistician is to catalyze the scientific learning
process."  - George E. P. Box
 
 

> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch 
> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Chris 
> Bergstresser
> Sent: Wednesday, April 13, 2005 9:37 AM
> To: r-help at stat.math.ethz.ch
> Subject: [R] Normalization and missing values
> 
> Hi all --
> 
>     I've got a large dataset which consists of a bunch of different 
> scales, and I'm preparing to perform a cluster analysis.  I need to 
> normalize the data so I can calculate the difference matrix.
>     First, I didn't see a function in R which does 
> normalization -- did 
> I miss it?  What's the best way to do it?
>     Second, what's the best way to deal with missing values?  
> Obviously, 
> I could just set them to 0 (the mean of the normalized 
> scales), but I'm 
> not sure that's the best way.
> 
> -- Chris
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
>




More information about the R-help mailing list