[BioC] What to do with multiple probes in GSEA?

Dick Beyer dbeyer at u.washington.edu
Fri Oct 6 19:01:06 CEST 2006


I'd like to get anyone's comments on how to select the best probeset from a group of probesets that represent the same gene.

In the Broad Institute's GSEA software, it is important to only have one entry per gene for the GSEA input files so as to avoid inflation of the enrichment score. In GSEA analysis (Subramanian et. al. PNAS 2005 and the GSEA user guide 
http://www.broad.mit.edu/gsea/doc/GSEAUserGuide.pdf), it is recommended that, 
from page 35 in the user guide:

"Collapsing mode for probe sets => 1 gene. Select the value to use for the 
single probe that will represent all probe sets for the gene: max_probe 
(default) to use the highest expression value or median_of_probes to use the 
median value."

I think that this procedure may not yield the best results in all cases. For example, I have several HG-U133A chips and on this chip there are six probesets for gene FN1: 214702_at, 214701_s_at, 212464_s_at, 210495_x_at, 211719_x_at, 216442_x_at. If I use the probeset with the highest expression value, then I would choose 212464_s_at.  However, on closer inspection, only two of these probesets, 214702_at, 214701_s_at, don't cross-hybridize.  And, only one of them, 214702_at, was originally designed to uniquely hybridize.

It seems reasonable to me to use this cross-hybridization information in the selection of the best probeset.

There is further information on affy probe transcript assignment that results in the assignment of a Grade (A,B,C,E,R).  The document I received from Affy "Transcript Assignment Whitepaper101205.doc" describes this process.  Perhaps that information should be used as well.

Since the cross-hybridization information is available in electronic form, HG-U133A_annot.csv from the affy website, it seems relatively easy to use this for some type of filtering.

In summary, in the context of GSEA, when you have to choose the best single probeset from a set of probesets that all represent the same gene, should those probesets that are known to cross-hybridize be rejected?  Of the remaining list of probesets, should the ones with a higher Grade be referred over those with a lower Grade?

Generally speaking, I would probably want to live with more false positives and discuss such cases with the investigator to see if further validation is necessary.  In my particular case of FN1 on HG-U133A, the cross-hybing probesets give the opposite differential expression as the non-cross-hybing probesets do.  I suppose you could also run more than one GSEA analysis using different criteria for multiple probesets, but that seems a bit daunting.

If anyone has comments, I will collect them and report back.

Thanks very much,
Dick

*******************************************************************************
Richard P. Beyer, Ph.D.	University of Washington
Tel.:(206) 616 7378	Env. & Occ. Health Sci. , Box 354695
Fax: (206) 685 4696	4225 Roosevelt Way NE, # 100
 			Seattle, WA 98105-6099
http://depts.washington.edu/ceeh/ServiceCores/FC5/FC5.html
http://staff.washington.edu/~dbeyer



More information about the Bioconductor mailing list