[BioC] Duplicate probes

Wed Mar 21 18:34:07 CET 2012

Hi Ed,

On 3/21/2012 12:20 PM, Ed Siefker wrote:
> I am analyzing affymetrix hgu133plus2 arrays with limma.
> These arrays sometimes contain multiple probes for a
> single gene.  I would like to combine the readings so that
> I get exactly one estimate of fold change per (entrez) gene.
>
> I looked at the duplicateCorrelation() function, but that
> doesn't seem to apply.  If I understand correctly, it's for
> averaging duplicate spots per probe, not duplicate probes
> per gene. It requires the same number of duplicates across
> the chip anyway, which I don't have.
>
> Just for illustration, here's a sample of some normalized
> expression data:
>
>
> control average test average test-control Linear fold change
> GENE1 2.38127 4.00571 1.62444 3.08322
> GENE1 12.1182 13.5405 1.42224 2.68001
> GENE1 9.85812 11.4534 1.59533 3.02163
> GENE2 12.9662 12.7992 -0.1670 0.89070
> GENE3 12.9649 12.9777 0.01275 1.00887
> GENE3 2.23400 2.22957 -0.0044 0.99693
> GENE3 11.8682 11.6099 -0.2583 0.83606
>
>
> So it's pretty obvious that I can't just average the expression
> values, as they range from around to around 12 for the same
> gene.  It's also clear that I can't just filter out the probes with
> the least fold change, because that would lead to GENE1 and
> GENE3 being called as differentially expressed, when the data
> appears to support differential expression of GENE3 much
> more strongly than it does GENE1.
>
> For GENE3, 3 of 3 probes show a fold change near 3.  For
> GENE1, 2 of 3 probes show no fold change at all.  How do I use
> this information to adjust the estimation of confidence in differential
> expression?

Depends on what assumptions you want to make.

You could assume that some of the probesets don't do a good job of 
measuring the transcript of interest, and just select the one probeset 
with the largest difference in a given comparison. See findLargest() in 
the genefilter package.

You could assume that some of the probes/probesets don't really measure 
the transcript of interest and use an alternative probe to probeset 
mapping that only uses those probes shown to actually be complementary 
to the transcript of interest. See e.g. 
http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/15.0.0/entrezg.asp.

These re-mapped cdfs can be installed using 
biocLite("hgu133plus2hsentrezgcdf") and then used with e.g., 
ReadAffy(cdfname="hgu133plus2hsentrezgcdf").

Or you could assume that some of the duplicate probesets measure 
differentially spliced transcripts, and leave them all in, and deal with 
duplicates on the back end (validation, etc).

I don't know of any other readily accessible ways to deal with these 
probesets, but others may chime in with suggestions.

Best,

Jim

>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099