[BioC] normalization of a microarray like dataframe and removingmissing data by % missing
michael watson (IAH-C)
michael.watson at bbsrc.ac.uk
Tue Feb 13 22:59:41 CET 2007
This is a dirty example of subsetting via the NA's
# create some dummy NA data
mat <- matrix(rnorm(50), nrow=10, ncol=5)
lmat <- log(mat)
# convert the output of is.na() to integers and coerce to the same structure as the data
naint <- matrix(as.integer(is.na(lmat)), nrow=10, ncol=5)
# sum the rows, thus counting the NA's (now 1's) and divide by the number of columns
# if this is less than equal to 0.3, we like the row
rows.want <- rowSums(naint) / ncol(naint) <= 0.3
# subset the data
lmat[rows.want,]
________________________________
From: bioconductor-bounces at stat.math.ethz.ch on behalf of ALAN SMITH
Sent: Tue 13/02/2007 9:10 PM
To: bioconductor at stat.math.ethz.ch
Subject: [BioC] normalization of a microarray like dataframe and removingmissing data by % missing
Hello,
I have several questions about data normalization of a large matrix of
intensity data (21269,72) (non-microarray data).
summary(MYdata) #### example of data NOTE many NAs ########
ID a b
c
Min. : 1 Min. : 2003 Min. : 2008 Min.
: 2001
1st Qu.: 5318 1st Qu.: 4027 1st Qu.: 4155 1st Qu.: 4331
Median :10635 Median : 7635 Median : 7570 Median : 8006
Mean :10635 Mean : 57586 Mean : 73246 Mean : 101309
3rd Qu.:15952 3rd Qu.: 17191 3rd Qu.: 18076 3rd Qu.: 18843
Max. :21269 Max. :20335320 Max. :30073282 Max. :27649912
NA's : 18323 NA's : 18467
NA's : 18471
##########################################################
What would be the best way to normalize or preprocess this type of
data (80%+ missing)? A log2 transformation creates nice "similar"
shaped distributions with different medians. Currently I do this to
normalize
divide column data by (column median/min column meidan)
divide row data by (row median/min row meidan)
Repeat 1 more time
*the method I am using turns the distribution into a spike shape.*
Is the above method acceptable for for future statistical
applications? Are there better normalization methods I can use?
NA question
I would like to remove all of the rows with less than 30% missing
values before continuing with normalization, but I cannot figure out
how. Is there a way to remove all rows that have say more than 30%
missing data? If i could just count the number of NAs in a row and
divide it by the number of columns i would be in good shape, but I
cant figure out how to do this. NA.OMIT is too harsh and remove most
of the rows.
Thanks much,
Alan
_______________________________________________
Bioconductor mailing list
Bioconductor at stat.math.ethz.ch
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list