[BioC] Duplicate probes
J.delasHeras at ed.ac.uk
J.delasHeras at ed.ac.uk
Wed Mar 21 18:38:51 CET 2012
Quoting Ed Siefker <ebs15242 at gmail.com> on Wed, 21 Mar 2012 11:20:17 -0500:
> I am analyzing affymetrix hgu133plus2 arrays with limma.
> These arrays sometimes contain multiple probes for a
> single gene. I would like to combine the readings so that
> I get exactly one estimate of fold change per (entrez) gene.
> I looked at the duplicateCorrelation() function, but that
> doesn't seem to apply. If I understand correctly, it's for
> averaging duplicate spots per probe, not duplicate probes
> per gene. It requires the same number of duplicates across
> the chip anyway, which I don't have.
> Just for illustration, here's a sample of some normalized
> expression data:
> control average test average test-control Linear fold change
> GENE1 2.38127 4.00571 1.62444 3.08322
> GENE1 12.1182 13.5405 1.42224 2.68001
> GENE1 9.85812 11.4534 1.59533 3.02163
> GENE2 12.9662 12.7992 -0.1670 0.89070
> GENE3 12.9649 12.9777 0.01275 1.00887
> GENE3 2.23400 2.22957 -0.0044 0.99693
> GENE3 11.8682 11.6099 -0.2583 0.83606
> So it's pretty obvious that I can't just average the expression
> values, as they range from around to around 12 for the same
> gene. It's also clear that I can't just filter out the probes with
> the least fold change, because that would lead to GENE1 and
> GENE3 being called as differentially expressed, when the data
> appears to support differential expression of GENE3 much
> more strongly than it does GENE1.
> For GENE3, 3 of 3 probes show a fold change near 3. For
> GENE1, 2 of 3 probes show no fold change at all. How do I use
> this information to adjust the estimation of confidence in differential
it's not trivial to decide what to do with multiple probes.
There are methods to summarise probeset data, using some kind of
weighted median algorithm and other ways. But the truth is that
sometimes probes "misbehave": they do not provide the signal we
expect. Perhaps they crosshybridise with other RNAs that we do not
know in principle about, for instance transcript variants that are not
I personally have decided to keep each probe separate in my analyses.
When I look at my list of DE genes, I would expect to find that if a
transcript is represented in my arrays 3 times, by 3 different probes,
I would get all three in my DE list). If I get only 2, I can then ask
why the third probe did not behave the same way... that information is
sometimes interesting, as you have the sequence information and can
check where it matches. Sometimes you just can't figure it out... but
I think 2 out of 3 is decent, so I keep it in my list. Even if you
only get one hit, it can be a good hit... The bottom line is it will
be hard to decide which to discount and which to trust without
detailed investigation... and possibly experimentation. That's ok if
you decide to focus on a handful of transcripts after seeing your
results, but not practical large-scale.
So if you must provide just one number, I would choose one probe and
display that information, but I would never average across probes. How
to choose which one... it's up to you ;)
If they behave similarly... pick one randomly.
If there are two different behaviours... you can display a
representative of each, or pick the most common one, or the one that
displays a behaviour most interesting for your purposes. There is no
general rule. I favour showing a representative for each behaviour:
the fact that I do not understand why I get different behaviours does
not necessarily mean one of them is an artifact, so I like to avoid
discarding any information I might find useful later as I learn more
about the system.
When you then summarise and count genes/transcripts/probes, just state
what it is that you are counting. I don't think there is anything
wrong saying you identified 10 genes, and showing a table with 12
rows, where two of the genes have two entries each. But it all really
depends on your goal.
Dr. Jose I. de las Heras Email: J.delasHeras at ed.ac.uk
The Wellcome Trust Centre for Cell Biology Phone: +44 (0)131 6507090
Institute for Cell & Molecular Biology Fax: +44 (0)131 6507360
Swann Building, Mayfield Road
University of Edinburgh
Edinburgh EH9 3JR
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
More information about the Bioconductor