[BioC] Duplicate probes

Wed Mar 21 17:20:17 CET 2012

I am analyzing affymetrix hgu133plus2 arrays with limma.
These arrays sometimes contain multiple probes for a
single gene.  I would like to combine the readings so that
I get exactly one estimate of fold change per (entrez) gene.

I looked at the duplicateCorrelation() function, but that
doesn't seem to apply.  If I understand correctly, it's for
averaging duplicate spots per probe, not duplicate probes
per gene. It requires the same number of duplicates across
the chip anyway, which I don't have.

Just for illustration, here's a sample of some normalized
expression data:

control average test average test-control Linear fold change
GENE1 2.38127 4.00571 1.62444 3.08322
GENE1 12.1182 13.5405 1.42224 2.68001
GENE1 9.85812 11.4534 1.59533 3.02163
GENE2 12.9662 12.7992 -0.1670 0.89070
GENE3 12.9649 12.9777 0.01275 1.00887
GENE3 2.23400 2.22957 -0.0044 0.99693
GENE3 11.8682 11.6099 -0.2583 0.83606

So it's pretty obvious that I can't just average the expression
values, as they range from around to around 12 for the same
gene.  It's also clear that I can't just filter out the probes with
the least fold change, because that would lead to GENE1 and
GENE3 being called as differentially expressed, when the data
appears to support differential expression of GENE3 much
more strongly than it does GENE1.

For GENE3, 3 of 3 probes show a fold change near 3.  For
GENE1, 2 of 3 probes show no fold change at all.  How do I use
this information to adjust the estimation of confidence in differential
expression?