[BioC] Filtering Affymetrix data towards class discovery

Tan, MinHan MinHan.Tan at vai.org
Sat Mar 27 19:49:33 CET 2004


Good afternoon,
 
I have a question on an optimal strategy for filtering of Affymetrix data (human tumor tissue) geared towards a purpose of class discovery. (This does not seem to have been directly addressed in the archive).
 
Since we are not correlating with any clinical outcomes or markers, I would not perform filtering in correlation with any of these indices.
 
A recent paper in PNAS on class discovery of tumor tissue subtypes (spot cDNA arrays) used the following strategy for filtering:  "Full sample set using genes well measured in * 75% of samples and variably expressed * 3-fold from the mean in at least two samples (5,153 genes). 
 
Considering this strategy for Affy data- there are no NAs, so it would seem that it is not necessary to use the first point "well-measured in > 75% of samples".
 
Would it make sense to use the second filter 'variably expressed >3 fold from mean in at least 2 samples' for rma normalized data, or would it be too noisy? (This would probably be too noisy for otherwise unfiltered MAS5.0 data at low intensities, I suspect) I have been using a strategy on filtering Affy data based on coefficient of variation (sd/mean) combined with a minimum of 2 samples with an rma expression value of 8 (2^8-256), but I am not sure how best to validate such an approach. I am particularly concerned about the fact that cv is a single value for each gene derived from across the sample set, and thus, I may not be able to capture small subclusters, esp. with a large sample number.
 
I wonder if this makes sense - based on the assumption that Affymetrix CEL intensities below 150 are unreliable and indicative of merely a low value (derived from a couple of sources) - I would aim towards filtering in genes with at least 2 samples with a intensity of 200-300 (depending on no. of samples) in order to pick up at least a distinct downregulation, with no issue of 'reliability below 150'. I guess the next problem is how to capture small changes in intensity (if even possible) - if I were to use a fold-change filter, I would miss out on genes that were expressed, say, 10000 (13.2) in 50 samples, and 15000 (13.9) in another 50 samples. If I were to use cv, I may miss out on a subcluster.
 
Your advice would be greatly appreciated!

Min-Han Tan
 

This email message, including any attachments, is for the so...{{dropped}}



More information about the Bioconductor mailing list