[BioC] select one Affy probeset for one gene
Robert Gentleman
rgentlem at fhcrc.org
Tue Mar 14 06:45:56 CET 2006
James W. MacDonald wrote:
> Robert Gentleman wrote:
>
>>Hi,
>>
>>Sean Davis wrote:
>>
>>
>>>On 3/13/06 3:38 PM, "Glazko, Galina" <Galina_Glazko at urmc.rochester.edu>
>>>wrote:
>>>
>>>
>>>
>>>
>>>>Dear list,
>>>>
>>>>
>>>>
>>>>Is there a way to automatically select one probeset for one gene in Affy
>>>>arrays?
>>>>
>>>>Say, if we have several probesets for a given gene, we select the one
>>>>with the highest level of expression, or based on any other reasonable
>>>>criteria...?
>>>>
>>>>I am sorry if this question was answered before, it seems to be very
>>>>basic question and I hope there is the solution...
>>>
>>>
>>>Galina,
>>>
>>>You can contrive a solution, I suppose. However, I'm not sure this is a
>>>good idea. Whatever "reasonable criteria" you use are likely to lead to
>>>bias. Filtering on unmeasured probesets or other quality measures applied
>>>equally to all probesets is probably reasonable, but not applying on a
>>>per-gene basis. There have been related discussions in the past, often
>>>centering around "averaging" expression values.
>>>
>>>The more accepted way of dealing with multiple probesets is to do your
>>>analysis based on the probeset; only after that is done do you then connect
>>>your gene labels back to the probesets.
>>
>>
>>
>> Unfortunately that approach does not always work and something needs
>>to be done a bit earlier in the process if a user wants to make use of
>>data such as GO, chromosomal location etc where the mapping is based on
>>Entrez Gene ID (for example, but other identifiers have very similar
>>issues). Not removing the duplicates leads to often quite different
>>results (in essence there is over counting if all probes are accurate).
>>As users of GOstats know, you have to choose one candidate for each
>>Entrez gene id (and probably what I have been doing there is not ideal -
>>the suggestion below, due to Seth Falcon is, I think, better). But I
>>would be interested to hear other points of view.
>>
>> I also do not like averaging for several reasons. Now, I have two
>>kinds of measurements (averages and ordinary old probes) and that is
>>problematic for some uses. Second, if not all of the probes work (which
>>might be why there are several variants) then I am averaging the good
>>with the bad, which also seems like a less than ideal way to go.
>
>
> One inherent problem with using the Affy probesets is that there are
> known issues with many of the probes; some measure related transcripts
> and others measure unrelated transcripts, so what you are measuring is
> not always clear. The MBNI cdfs which have been re-mapped may help with
> at least two of these problems. First, all probes that no longer blast
> to the transcript of interest are removed from consideration. Second,
> all probes that do blast to the transcript of interest are piled
> together into one probeset (I guess you could argue this is bad since
> the expression measures are now based on variable numbers of probes, but
> that is already true anyway...). Note that these cdfs are planned to be
> part of the new release of BioC, but currently are only available from
> the MBNI website
>
> http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/genomic_curated_CDF.asp
>
> Since you now have only one probeset per gene (based on Entrez Gene,
> UniGene, RefSeq, or Ensembl) you no longer have to decide which one to
> use. The biggest downside to using these cdfs is the lack of
> infrastructure in BioC that is tailored to their use, which requires a
> higher level of understanding of R than one would need to use a 'stock'
> cdf (which reminds me - I should be doing something about that ;-D).
Hi,
These are good points, but I think that they are complementary rather
than a strict replacement. First, I might just have expression data, not
CEL files, so this approach would not be an option. Second, I might
decide to map to Unigene or RefSeq, and then would still have the same
problem these do not necessarily have a 1-1 correspondence with Entrez
gene. And finally, I might be working with cDNA arrays where there is no
clear way to take this same approach. That is not to say that this is
not a viable approach and it certainly does solve some problems,
best wishes
Robert
>
> HTH,
>
> Jim
>
>
>
>> One suggestion is to do non-specific filtering (say on variation, or
>>for expressed versus not, or something of that ilk) and to then select
>>the probe set that has the highest value. Thus, you are selecting the
>>probe with the most information (but do be careful not to use any
>>phenotypic information as this could cause problems). Your (Galina's)
>>suggestion was to use level of expression, but that is generally a bad
>>idea because that would involve a between probe within array comparison
>>and these are not ideal; just because one spot is brighter does not mean
>>it works better, or that there is more mRNA than a less bright spot.
>>
>> HTH
>> Robert
>>
>>
>>
>>
>>>Sean
>>>
>>>_______________________________________________
>>>Bioconductor mailing list
>>>Bioconductor at stat.math.ethz.ch
>>>https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>
>>
>>
>
>
--
Robert Gentleman, PhD
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
PO Box 19024
Seattle, Washington 98109-1024
206-667-7700
rgentlem at fhcrc.org
More information about the Bioconductor
mailing list