[BioC] Duplicate probes

Wed Mar 21 18:38:51 CET 2012

Quoting Ed Siefker <ebs15242 at gmail.com> on Wed, 21 Mar 2012 11:20:17 -0500:

> I am analyzing affymetrix hgu133plus2 arrays with limma.
> These arrays sometimes contain multiple probes for a
> single gene.  I would like to combine the readings so that
> I get exactly one estimate of fold change per (entrez) gene.
>
> I looked at the duplicateCorrelation() function, but that
> doesn't seem to apply.  If I understand correctly, it's for
> averaging duplicate spots per probe, not duplicate probes
> per gene. It requires the same number of duplicates across
> the chip anyway, which I don't have.
>
> Just for illustration, here's a sample of some normalized
> expression data:
>
>
> control average test average test-control Linear fold change
> GENE1 2.38127 4.00571 1.62444 3.08322
> GENE1 12.1182 13.5405 1.42224 2.68001
> GENE1 9.85812 11.4534 1.59533 3.02163
> GENE2 12.9662 12.7992 -0.1670 0.89070
> GENE3 12.9649 12.9777 0.01275 1.00887
> GENE3 2.23400 2.22957 -0.0044 0.99693
> GENE3 11.8682 11.6099 -0.2583 0.83606
>
>
> So it's pretty obvious that I can't just average the expression
> values, as they range from around to around 12 for the same
> gene.  It's also clear that I can't just filter out the probes with
> the least fold change, because that would lead to GENE1 and
> GENE3 being called as differentially expressed, when the data
> appears to support differential expression of GENE3 much
> more strongly than it does GENE1.
>
> For GENE3, 3 of 3 probes show a fold change near 3.  For
> GENE1, 2 of 3 probes show no fold change at all.  How do I use
> this information to adjust the estimation of confidence in differential
> expression?

Hi Ed,

it's not trivial to decide what to do with multiple probes.
There are methods to summarise probeset data, using some kind of  
weighted median algorithm and other ways. But the truth is that  
sometimes probes "misbehave": they do not provide the signal we  
expect. Perhaps they crosshybridise with other RNAs that we do not  
know in principle about, for instance transcript variants that are not  
annotated.
I personally have decided to keep each probe separate in my analyses.  
When I look at my list of DE genes, I would expect to find that if a  
transcript is represented in my arrays 3 times, by 3 different probes,  
I would get all three in my DE list). If I get only 2, I can then ask  
why the third probe did not behave the same way... that information is  
sometimes interesting, as you have the sequence information and can  
check where it matches. Sometimes you just can't figure it out... but  
I think 2 out of 3 is decent, so I keep it in my list. Even if you  
only get one hit, it can be a good hit... The bottom line is it will  
be hard to decide which to discount and which to trust without  
detailed investigation... and possibly experimentation. That's ok if  
you decide to focus on a handful of transcripts after seeing your  
results, but not practical large-scale.
So if you must provide just one number, I would choose one probe and  
display that information, but I would never average across probes. How  
to choose which one... it's up to you ;)
If they behave similarly... pick one randomly.
If there are two different behaviours... you can display a  
representative of each, or pick the most common one, or the one that  
displays a behaviour most interesting for your purposes. There is no  
general rule. I favour showing a representative for each behaviour:  
the fact that I do not understand why I get different behaviours does  
not necessarily mean one of them is an artifact, so I like to avoid  
discarding any information I might find useful later as I learn more  
about the system.
When you then summarise and count genes/transcripts/probes, just state  
what it is that you are counting. I don't think there is anything  
wrong saying you identified 10 genes, and showing a table with 12  
rows, where two of the genes have two entries each. But it all really  
depends on your goal.

Jose

-- 
Dr. Jose I. de las Heras                      Email: J.delasHeras at ed.ac.uk
The Wellcome Trust Centre for Cell Biology    Phone: +44 (0)131 6507090
Institute for Cell & Molecular Biology        Fax:   +44 (0)131 6507360
Swann Building, Mayfield Road
University of Edinburgh
Edinburgh EH9 3JR
UK

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.