[BioC] select one Affy probeset for one gene
Robert Gentleman
rgentlem at fhcrc.org
Mon Mar 13 22:55:49 CET 2006
Hi,
Sean Davis wrote:
>
>
> On 3/13/06 3:38 PM, "Glazko, Galina" <Galina_Glazko at urmc.rochester.edu>
> wrote:
>
>
>>Dear list,
>>
>>
>>
>>Is there a way to automatically select one probeset for one gene in Affy
>>arrays?
>>
>>Say, if we have several probesets for a given gene, we select the one
>>with the highest level of expression, or based on any other reasonable
>>criteria...?
>>
>>I am sorry if this question was answered before, it seems to be very
>>basic question and I hope there is the solution...
>
>
> Galina,
>
> You can contrive a solution, I suppose. However, I'm not sure this is a
> good idea. Whatever "reasonable criteria" you use are likely to lead to
> bias. Filtering on unmeasured probesets or other quality measures applied
> equally to all probesets is probably reasonable, but not applying on a
> per-gene basis. There have been related discussions in the past, often
> centering around "averaging" expression values.
>
> The more accepted way of dealing with multiple probesets is to do your
> analysis based on the probeset; only after that is done do you then connect
> your gene labels back to the probesets.
Unfortunately that approach does not always work and something needs
to be done a bit earlier in the process if a user wants to make use of
data such as GO, chromosomal location etc where the mapping is based on
Entrez Gene ID (for example, but other identifiers have very similar
issues). Not removing the duplicates leads to often quite different
results (in essence there is over counting if all probes are accurate).
As users of GOstats know, you have to choose one candidate for each
Entrez gene id (and probably what I have been doing there is not ideal -
the suggestion below, due to Seth Falcon is, I think, better). But I
would be interested to hear other points of view.
I also do not like averaging for several reasons. Now, I have two
kinds of measurements (averages and ordinary old probes) and that is
problematic for some uses. Second, if not all of the probes work (which
might be why there are several variants) then I am averaging the good
with the bad, which also seems like a less than ideal way to go.
One suggestion is to do non-specific filtering (say on variation, or
for expressed versus not, or something of that ilk) and to then select
the probe set that has the highest value. Thus, you are selecting the
probe with the most information (but do be careful not to use any
phenotypic information as this could cause problems). Your (Galina's)
suggestion was to use level of expression, but that is generally a bad
idea because that would involve a between probe within array comparison
and these are not ideal; just because one spot is brighter does not mean
it works better, or that there is more mRNA than a less bright spot.
HTH
Robert
>
> Sean
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
>
--
Robert Gentleman, PhD
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
PO Box 19024
Seattle, Washington 98109-1024
206-667-7700
rgentlem at fhcrc.org
More information about the Bioconductor
mailing list