[BioC] Question about inconsistent gene expression
Chunyan Liu
Chunyan.Liu at cchmc.org
Fri Aug 24 15:33:06 CEST 2007
Dear James MacDonald and all from the list,
I posted my question on the list last Wednesday and James MacDonald
gave me profound insight on the issue. The following is a quick review
and then another follow-up question.
>Liu: I'm doing gene expression comparisons between two groups of
subjects using affymetrix single-channel hgu133plus2 microarray chips.
After using limma, I get a list of up- and down-regulated probeset
(1,000 up and 2,000 down regulated probesets). When I translate these
into unique gene symbols, I found 200 gene symbols that appear in both
lists. Is this plausible? Interpretable?
>MacDonald: Ah, now that is the problem, isn't it? Another problem is
the case where 10 probesets are supposed to interrogate a particular
gene and one is significant, but the other nine are not. In that case is
the gene differentialy expressed or not? What you have to understand is
that Affy designed the probesets for this chip based on the UniGene
build 133, which was the best information at the time, but which is
really outdated now (we are on build 203 currently). Even when they
designed the chip, there were three levels of probesets. Those with an
_at suffix, which indicated that the probes all blast exclusively to the
transcript in question, those with an _s_at (or _a_at, I forget what
they used for the 133), that indicates that some of the probes bind to
related transcripts (whatever 'related' means), and _x_at, which
indicates that some probes bind to completely unrelated transcripts. So
even when the chip was designed, some of the probesets were not nearly
as reliable as others. If you take the probe sequences and blast them
today, you can find _at probesets with probes that bind to unrelated
sequences, so time has not always been kind to the probe mappings. What
can you do about this problem? There are a couple of things you can do,
but any 'fix' has its own problems. First, you can use the remapped cdfs
that are made available by the MBNI at the University of Michigan (via
BioC). These remapped cdfs discard the original probesets and only use
those probes that are known to map to unique sequences in the genome
(based on the current UniGene build), and then map to transcripts or
genes based on Entrez Gene, GenBank, UniGene, Ensembl, etc. The upside
to these cdfs is that you will have only one probeset per
transcript/gene, so it will be impossible to have a gene symbol
appearing in both the up and down regulated groups. In addition, the
assumptions of say RMA or GCRMA (or any probe-level models in affyPLM)
will again hold true; in other words, the intensity of a given probe
will be due only to the level of the transcript it is supposed to
measure plus the probe-specific binding. The downside of these cdfs is
that the number of probes per probeset will vary from something like 3 -
150, so the standard error of your estimate will also vary widely. If
you simply take the expression values for these probesets and analyze
using limma, you will be ignoring this extra level of error (which you
can safely ignore using the 'stock' affy cdfs, since most of those
probesets have 11 probes per). Second, you can just use the 'stock' affy
cdfs, and do some ad hoc method to decide which of the probesets to
believe. You can simply choose to believe only the _at probesets. Or you
can decide to blast (or blat, which is much faster and AFAICT nearly as
accurate) each of the disagreeing probesets to see which one appears to
actually measure the gene transcript in question. The upside here is you
don't have the extra level of variability introduced by the MBNI cdfs,
but the downside is the amount of extra work it will entail.
Follow-up question: Now, my thought is, given the noisy nature of
microarray gene expression, I do believe for any mild to moderate gene
effect, maybe even with strong gene effect like in cancer, this is a
common but unrecognized problem. Does anyone notice this in their
microarray analysis? If you do, please share your experience and
thoughts with us because I really think this is something that deserves
attention.
I did some analyses based on James response and here is what I found:
Among the 3000 probesets list from limma that were significantly
expressed, some genes had multiple probesets (but only about 500), the
following table summarizes how the disagreement varies by the number of
signigicant probesets within a gene. For example, there were 400 genes
that were represented by 2 significant probesets and of the 80 genes
were inconsistent! That is, one probeset was up-regulated and the other
was down-regulated. Of course, the more probesets, the greater the
chance of one disagreement.
probesets #of genes #of disagree % with at
least one disagreement
2 400 80
20%
3 100 40
40%
4 30 15
50%
5 5 4
80%
7 1 1
100%
Then I picked out the probesets only with the suffix "_at" and did the
same as above. Here are what I found. There were a total of about 2000
"_at" probesets.
probesets #of genes #of disagree %
2 120 40
33%
3 40 20
50%
4 6 3
50%
5 2 2
100%
An easy fix to not deal with the problem is to select the probesets
with the largest IQR (or any other statistic) to get one probeset per
gene as in GeneSet Enrichment Analysis by Robert Gentlemen. Any thoughts
on doing this?
So, has anyone else observed this result? is this common? So please
share your thoughts with me.
Thank you,
Chunyan Liu
Cincinnati Children's Hospital Medical Center
More information about the Bioconductor
mailing list