[BioC] normalization of a microarray like dataframe and removing missing data by % missing
ALAN SMITH
alansmith2 at gmail.com
Tue Feb 13 22:10:50 CET 2007
Hello,
I have several questions about data normalization of a large matrix of
intensity data (21269,72) (non-microarray data).
summary(MYdata) #### example of data NOTE many NAs ########
ID a b
c
Min. : 1 Min. : 2003 Min. : 2008 Min.
: 2001
1st Qu.: 5318 1st Qu.: 4027 1st Qu.: 4155 1st Qu.: 4331
Median :10635 Median : 7635 Median : 7570 Median : 8006
Mean :10635 Mean : 57586 Mean : 73246 Mean : 101309
3rd Qu.:15952 3rd Qu.: 17191 3rd Qu.: 18076 3rd Qu.: 18843
Max. :21269 Max. :20335320 Max. :30073282 Max. :27649912
NA's : 18323 NA's : 18467
NA's : 18471
##########################################################
What would be the best way to normalize or preprocess this type of
data (80%+ missing)? A log2 transformation creates nice "similar"
shaped distributions with different medians. Currently I do this to
normalize
divide column data by (column median/min column meidan)
divide row data by (row median/min row meidan)
Repeat 1 more time
*the method I am using turns the distribution into a spike shape.*
Is the above method acceptable for for future statistical
applications? Are there better normalization methods I can use?
NA question
I would like to remove all of the rows with less than 30% missing
values before continuing with normalization, but I cannot figure out
how. Is there a way to remove all rows that have say more than 30%
missing data? If i could just count the number of NAs in a row and
divide it by the number of columns i would be in good shape, but I
cant figure out how to do this. NA.OMIT is too harsh and remove most
of the rows.
Thanks much,
Alan
More information about the Bioconductor
mailing list