[BioC] Queries about getting annotation post-Limma analysis
James W. MacDonald
jmacdon at uw.edu
Wed May 15 17:02:29 CEST 2013
Hi Jeremy,
On 5/14/2013 8:35 PM, Jeremy Ng wrote:
> Hi James and Paul,
>
> Thanks really much for the advice (and thanks James for the package, it
> works like a charm)
>
> Just want to clarify (about the Affymetrix Human Exon array as a whole); it
> seems to be that there are no available db packages that can be used? (the
> closest to an annotation package for use is the pd.huex.1.0.st.v2 package,
> which is used for Oligo RMA normalization).
Not exactly. What you have to understand about the exon arrays is that
they are really complex, and any analysis is not likely to be simple or
straightforward. This is, IMO, why they now have the ability to
summarize at the transcript level, so people can do fairly naive analyses.
The reason the exon arrays are complex has to do with a couple of
factors. First, there is a ton of speculative content on the chip, where
the evidence for transcription at a given site is fairly speculative.
Back in the day, you could only summarize at the probeset level, and the
only difference between core, full, and extended was how much
speculative content was included.
Second, unlike the 3'-biased arrays, the Exon (and Gene ST) arrays share
probes between probesets, and the probes may target multiple regions. In
other words, for the 3'-biased arrays Affy tried to ensure that a given
probe and/or probeset only bound to a single region, so when you got
binding on the array you could attribute that binding to a single
transcript. In addition, a given probe was only part of a single
probeset. For the Exon and Gene arrays, a single probe may be part of
many different probesets. In addition, there may be multiple probesets
that target a single exon.
It wouldn't be difficult to make an annotation package for the Exon
arrays, but I don't think it is really advisable. This would
intentionally mask the underlying complexity of the arrays, and allow
naive users to just Do Something(tm) and then report the results.
Instead, if you are going to analyze your data at the probeset level,
you should really be using the annmap package. See for example, the
annmap cookbook:
http://bioconductor.org/packages/2.12/bioc/vignettes/annmap/inst/doc/cookbook.pdf
in particular, starting on the bottom of page 10.
You will see right away that analyzing the Exon arrays at the probeset
level is really complicated.
If you still want to do something really naive, building the .db package
isn't that difficult. You need the probeset.csv annotation file from
Affy, and the AnnotationForge package. You will then need to parse
whatever column has the correct annotation (it is the mrna_assignment
column in the transcript.csv file, so maybe the same column?).
That column will likely have something like this:
ENST00001 // some // cruft // here /// NM_0329091 // different // cruft
// here /// etc etc
In general you want the NM_ or NR_ (RefSeq) or GenBank IDs, but those
are sort of hard to get as they can be lots of different things. You can
also use the Ensembl transcript IDs, but will have to map. I usually
parse like so:
dat <- read.csv("probesest filename goes here", comment.char = "#",
stringsAsFactors=FALSE, na.string = "---")
mrna <- sapply(strsplit(dat$mrna_assignment, " // | /// "), function(x)
grep("^NM|^NR|^[A-E][A-Z]+[0-9]+", x, value = TRUE)[1])
So the mrna vector will contain RefSeq IDs, or GenBank if there were no
RefSeq, or Ensembl Transcript IDs if none of the former two, or NA if
there wasn't anything. You can then extract the Ensembl Transcript IDs
and map to RefSeq or GenBank
ens <- grep("^ENST", mrna, value = TRUE)
library(org.Hs.eg.db)
ens <- select(org.Hs.eg.db, ens, c("REFSEQ","ACCNUM"), "ENSEMBLTRANS")
ens <- ens[!duplicated(ens[,1]),]
## use accnum if refseq is NA
ens[is.na(ens[,2]),2] <- ens[is.na(ens[,2]),3]
## put mapped data back in mrna vector
mrna[match(ens[,1], mrna)] <- ens[,2]
## write out
write.table(cbind(dat[,1], mrna), "huex_mapper.txt", sep = "\t", quote =
FALSE, row.names = FALSE, col.names = FALSE)
Then use this file as input to makeDBPackage() from AnnotationForge.
Best,
Jim
Best,
Jim
>
> I asked because other than summarization at the transcript level, I would
> also be doing summarization at the probeset level, and then again,
> post-limma, I would need to retrieve the annotations for the probes for
> further analysis.
>
> Thanks for your time, and thanks James for the package!
>
> Jeremy
>
>
> On Wed, May 15, 2013 at 3:28 AM, James W. MacDonald<jmacdon at uw.edu> wrote:
>
>> Hi Paul (and Jeremy),
>>
>> On 5/14/2013 3:11 PM, Paul Shannon wrote:
>>
>>> Hi Jeremy,
>>>
>>> how can I map the Affy ID which is found in the results from topTable to
>>>> an ENSEMBL and an ENTREZ gene ID
>>>>
>>> The bioc annotation package "hugene10stprobeset.db" and the "select"
>>> interface should provide all of you need.
>>>
>> Jeremy is using the Human Exon ST array, and summarizing at the transcript
>> level. So he needs the (non-existant) huex10sttranscriptcluster.db package
>> to do what you suggest.
>>
>> I am building such a beast as we speak, but this array has almost 340K
>> probesets when summarized at the transcript level, so this is going
>> sloooowly.
>>
>> Best,
>>
>> Jim
>>
>>
>>
>>> biocLite("hugene10stprobeset.**db")
>>> library(hugene10stprobeset.db)
>>>
>>> # what kinds of data (what columns) are store in this annotation
>>> package?
>>> keytypes(hugene10stprobeset.**db)
>>>
>>> # do a quick survey of each column
>>> for(key in keytypes(hugene10stprobeset.**db)){
>>> print(paste("---", key))
>>> print(head(keys(**hugene10stprobeset.db, keytype=key)))
>>> }
>>>
>>> # get a random sample of probe ids to use for testing
>>> sample.probe.ids<- sample(keys(**hugene10stprobeset.db,keytype=**"PROBEID"),
>>> size=10)
>>>
>>> # look these up using the select command. a data.frame is
>>> returned
>>> select(hugene10stprobeset.db, keys=sample.probe.ids,
>>> cols=c("ENTREZID", "ENSEMBL", "SYMBOL"))
>>> PROBEID ENTREZID ENSEMBL SYMBOL
>>> 1 8165444 29952 ENSG00000176978 DPP7
>>> 2 7970230 23263 ENSG00000126217 MCF2L
>>> 3 8045081 2840 ENSG00000144230 GPR17
>>> 4 7989809 54878 ENSG00000074603 DPP8
>>> 5 8015557 23415 ENSG00000089558 KCNH4
>>> 6 7930787 5406 ENSG00000175535 PNLIP
>>> 7 7984142 9960 ENSG00000140455 USP3
>>> 8 7894624<NA> <NA> <NA>
>>> 9 8170511 79057 ENSG00000130032 PRRG3
>>> 10 8105007 55100 ENSG00000082068 WDR70
>>>
>>>
>>> - Paul
>>>
>>>
>>>
>>> On May 14, 2013, at 8:42 AM, Jeremy Ng wrote:
>>>
>>> Dear all,
>>>> Following the RMA normalization of data from Affy Human Exon ST1.0 array
>>>> using the package Oligo (at the transcript level using target="core"), I
>>>> then conducted a limma analysis.
>>>>
>>>> The topTable argument in limma would then retrieve the top genes (in my
>>>> case, cause I am interested in subsequently doing GSEA analysis, I set
>>>> number=100) which are differentially addressed.
>>>>
>>>> The question I have is how can I map the Affy ID which is found in the
>>>> results from topTable to an ENSEMBL and an ENTREZ gene ID. Intuitively,
>>>> biomaRt comes to mind, and I did a biomaRt query for the list of top 100
>>>> genes which I had gotten, but I get only 14 hgnc symbols. I'd like to
>>>> think
>>>> that it's due to a lack of annotations, but I highly doubt so (14 in a
>>>> list
>>>> of 100 seems too little to me).
>>>>
>>>> My code for biomaRt is as follows:
>>>> mart<- useMart("ensembl", "hsapiens_gene_ensembl")
>>>>
>>>> hgnc<- getBM(attributes=c("hgnc_**symbol",
>>>> "ensembl_gene_id"),values=**top100$ID, filters="affy_huex_1_0_st_v2",
>>>> mart=mart)
>>>>
>>>> I was wondering 2 things:
>>>> 1. Is there any plausible explanation to why the query only returns 14
>>>> IDs;
>>>> and
>>>> 2. Are there other ways that I can use to fetch annotations from a
>>>> post-Limma analysis?
>>>>
>>>> My session info is as follows:
>>>> R version 3.0.0 (2013-04-03)
>>>> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>>>>
>>>> locale:
>>>> [1] C
>>>>
>>>> Thanks for any advice!
>>>>
>>>> Jeremy
>>>>
>>>> [[alternative HTML version deleted]]
>>>>
>>>> ______________________________**_________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
>>>> Search the archives: http://news.gmane.org/gmane.**
>>>> science.biology.informatics.**conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>>>
>>> ______________________________**_________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
>>> Search the archives: http://news.gmane.org/gmane.**
>>> science.biology.informatics.**conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>>
>> --
>> James W. MacDonald, M.S.
>> Biostatistician
>> University of Washington
>> Environmental and Occupational Health Sciences
>> 4225 Roosevelt Way NE, # 100
>> Seattle WA 98105-6099
>>
>>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099
More information about the Bioconductor
mailing list