[BioC] select one Affy probeset for one gene

James W. MacDonald jmacdon at med.umich.edu
Tue Mar 14 00:33:40 CET 2006

Robert Gentleman wrote:
> Hi,
> Sean Davis wrote:
>>On 3/13/06 3:38 PM, "Glazko, Galina" <Galina_Glazko at urmc.rochester.edu>
>>>Dear list,
>>>Is there a way to automatically select one probeset for one gene in Affy
>>>Say, if we have several probesets for a given gene, we select the one
>>>with the highest level of expression, or based on any other reasonable
>>>I am sorry if this question was answered before, it seems to be very
>>>basic question and I hope there is the solution...
>>You can contrive a solution, I suppose.  However, I'm not sure this is a
>>good idea.  Whatever "reasonable criteria" you use are likely to lead to
>>bias.  Filtering on unmeasured probesets or other quality measures applied
>>equally to all probesets is probably reasonable, but not applying on a
>>per-gene basis.  There have been related discussions in the past, often
>>centering around "averaging" expression values.
>>The more accepted way of dealing with multiple probesets is to do your
>>analysis based on the probeset; only after that is done do you then connect
>>your gene labels back to the probesets.
>   Unfortunately that approach does not always work and something needs 
> to be done a bit earlier in the process if a user wants to make use of 
> data such as GO, chromosomal location etc where the mapping is based on 
> Entrez Gene ID (for example, but other identifiers have very similar 
> issues). Not removing the duplicates leads to often quite different 
> results (in essence there is over counting if all probes are accurate). 
> As users of GOstats know, you have to choose one candidate for each 
> Entrez gene id (and probably what I have been doing there is not ideal - 
> the suggestion below, due to Seth Falcon is, I think, better). But I 
> would be interested to hear other points of view.
>   I also do not like averaging for several reasons. Now, I have two 
> kinds of measurements (averages and ordinary old probes) and that is 
> problematic for some uses. Second, if not all of the probes work (which 
> might be why there are several variants) then I am averaging the good 
> with the bad, which also seems like a less than ideal way to go.

One inherent problem with using the Affy probesets is that there are 
known issues with many of the probes; some measure related transcripts 
and others measure unrelated transcripts, so what you are measuring is 
not always clear. The MBNI cdfs which have been re-mapped may help with 
at least two of these problems. First, all probes that no longer blast 
to the transcript of interest are removed from consideration. Second, 
all probes that do blast to the transcript of interest are piled 
together into one probeset (I guess you could argue this is bad since 
the expression measures are now based on variable numbers of probes, but 
that is already true anyway...). Note that these cdfs are planned to be 
part of the new release of BioC, but currently are only available from 
the MBNI website


Since you now have only one probeset per gene (based on Entrez Gene, 
UniGene, RefSeq, or Ensembl) you no longer have to decide which one to 
use. The biggest downside to using these cdfs is the lack of 
infrastructure in BioC that is tailored to their use, which requires a 
higher level of understanding of R than one would need to use a 'stock' 
cdf (which reminds me - I should be doing something about that ;-D).



>    One suggestion is to do non-specific filtering (say on variation, or 
> for expressed versus not, or something of that ilk) and to then select 
> the probe set that has the highest value. Thus, you are selecting the 
> probe with the most information (but do be careful not to use any 
> phenotypic information as this could cause problems). Your (Galina's) 
> suggestion was to use level of expression, but that is generally a bad 
> idea because that would involve a between probe within array comparison 
> and these are not ideal; just because one spot is brighter does not mean 
> it works better, or that there is more mRNA than a less bright spot.
>   HTH
>    Robert
>>Bioconductor mailing list
>>Bioconductor at stat.math.ethz.ch

James W. MacDonald
University of Michigan
Affymetrix and cDNA Microarray Core
1500 E Medical Center Drive
Ann Arbor MI 48109

Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues.

More information about the Bioconductor mailing list