[BioC] Questions about gene identifiers and probesets regulation
James W. MacDonald
jmacdon at med.umich.edu
Thu Aug 16 17:02:47 CEST 2007
Hi Chunyan,
Chunyan Liu wrote:
> Dear all,
>
> I'm doing gene expression comparisons between two groups of subjects
> using affymetrix single-channel hgu133plus2 microarray chips and I have
> two questions.
>
> 1) Relationship among manufacturer ID, EntrezID, GenBank ID and gene
> SYMBOL: Is there any one-to-one mapping?
>
> I noticed that the hgu133plus2 environment gives annotations through
> Entrez ID. Is this always the case? It seems to me that one EntrezID
> corresponds to multiple manufacturer IDs (probe name), but is this the
> case between manufacturer ID and GenBank ID? Is it true that one
> EntrezID maps to one gene symbol?
I'm not sure if there is a one-to-one mapping from probeset ID to
GenBank ID, but there certainly isn't a one-to-one mapping of GenBank ID
to gene symbol (as GenBank IDs map things at the transcript level), so I
am not sure that would help.
I think there is a one-to-one mapping from Entrez Gene to symbol, but I
am not 100% sure about that.
>
> 2) Probesets: Another question is after using limma, I get a list of
> up- and down-regulated probeset when comparing two groups (1,000 up and
> 2,000 down regulated probesets). When I translate these into unique gene
> symbols, I find 200 gene symbols that appear in both lists. Is this
> plausible? Interpretable?
Ah, now that is the problem, isn't it? Another problem is the case where
10 probesets are supposed to interrogate a particular gene and one is
significant, but the other nine are not. In that case is the gene
differentialy expressed or not?
What you have to understand is that Affy designed the probesets for this
chip based on the UniGene build 133, which was the best information at
the time, but which is really outdated now (we are on build 203 currently).
Even when they designed the chip, there were three levels of probesets.
Those with an _at suffix, which indicated that the probes all blast
exclusively to the transcript in question, those with an _s_at (or
_a_at, I forget what they used for the 133), that indicates that some of
the probes bind to related transcripts (whatever 'related' means), and
_x_at, which indicates that some probes bind to completely unrelated
transcripts.
So even when the chip was designed, some of the probesets were not
nearly as reliable as others. If you take the probe sequences and blast
them today, you can find _at probesets with probes that bind to
unrelated sequences, so time has not always been kind to the probe mappings.
What can you do about this problem? There are a couple of things you can
do, but any 'fix' has its own problems.
First, you can use the remapped cdfs that are made available by the MBNI
at the University of Michigan (via BioC). These remapped cdfs discard
the original probesets and only use those probes that are known to map
to unique sequences in the genome (based on the current UniGene build),
and then map to transcripts or genes based on Entrez Gene, GenBank,
UniGene, Ensembl, etc.
The upside to these cdfs is that you will have only one probeset per
transcript/gene, so it will be impossible to have a gene symbol
appearing in both the up and down regulated groups. In addition, the
assumptions of say RMA or GCRMA (or any probe-level models in affyPLM)
will again hold true; in other words, the intensity of a given probe
will be due only to the level of the transcript it is supposed to
measure plus the probe-specific binding.
The downside of these cdfs is that the number of probes per probeset
will vary from something like 3 - 150, so the standard error of your
estimate will also vary widely. If you simply take the expression values
for these probesets and analyze using limma, you will be ignoring this
extra level of error (which you can safely ignore using the 'stock' affy
cdfs, since most of those probesets have 11 probes per).
Second, you can just use the 'stock' affy cdfs, and do some ad hoc
method to decide which of the probesets to believe. You can simply
choose to believe only the _at probesets. Or you can decide to blast (or
blat, which is much faster and AFAICT nearly as accurate) each of the
disagreeing probesets to see which one appears to actually measure the
gene transcript in question. The upside here is you don't have the extra
level of variability introduced by the MBNI cdfs, but the downside is
the amount of extra work it will entail.
HTH,
Jim
>
> Thank you very much for any input.
>
> Chunyan Liu
> Cincinnati Children's Hospital
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
James W. MacDonald, M.S.
Biostatistician
Affymetrix and cDNA Microarray Core
University of Michigan Cancer Center
1500 E. Medical Center Drive
7410 CCGC
Ann Arbor MI 48109
734-647-5623
More information about the Bioconductor
mailing list