[BioC] getting a list of genomic sequences from a list of gene names
Paul Shannon
pshannon at fhcrc.org
Tue Jul 16 19:48:14 CEST 2013
To summarize our off-list discussion, and present our final suggested answer, using a sample entrez geneID to demonstrate, we offer the code shown below.
A somewhat lengthy exposition of the 9 steps involved will be found below, following the 18 lines of excutable code.
Thanks for posing this question, Eric, and for your patience while I worked up a reply.
- Paul
library(Biostrings)
library(org.Hs.eg.db)
library (BSgenome.Hsapiens.UCSC.hg19)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
trak2 <- "66008" # sample gene
transcripts.grL <- transcriptsBy(txdb, by="gene")[trak2]
transcript.names <- mcols(unlist(transcripts.grL))$tx_name # "uc002uyb.4" "uc002uyc.2"
intronsByTx.grL <- intronsByTranscript(txdb, use.names=TRUE)[transcript.names]
print(elementLengths(intronsByTx.grL))
intron.seqs <- getSeq(Hsapiens, unlist(intronsByTx.grL))
intron.seqs.by.transcript <- relist(intron.seqs, skeleton=intronsByTx.grL)
print(elementLengths(intron.seqs.by.transcript))
# uc002uyb.4 uc002uyc.2
# 15 7
exonsByTx.grL <- exonsBy(txdb, by="tx", use.names=TRUE)[transcript.names]
print(elementLengths(exonsByTx.grL))
exons.seq <- getSeq(Hsapiens, unlist(exonsByTx.grL))
exon.seqs.by.transcript <- relist(exons.seq, skeleton=exonsByTx.grL)
print(elementLengths(exon.seqs.by.transcript))
# uc002uyb.4 uc002uyc.2
# 16 8
---- The strategy
1) DNA sequence for the introns and exons of a gene, is best understood as
"sequence for the introns and exons for each known transcript of a gene"
2) So the first task is to identify all the transcripts associated with the gene of interest
This will be represented as a GRangesList, where each element in the list is
a) named by the geneID
b) contains a GRanges of the coordinates, name and id of each associated transcript
All of this information comes from a Bioc "TxDB" object, which in this case is our representation
of the UCSC "knownGene" table.
3) Once you have such a GRangesList, for one or more geneIDs, you want to ignore (for now)
that gene information, and just extract the names of all of the transcripts. To do
this extraction, you must flatted (or "unlist") the GRangesList.
4) Now you have a simple character vector containing the names of all the
transcripts associated with your gene of interest.
5) Use those names to obtain, as a GRangesList structured by transcript name, the
coordinates of all introns, then of all exons. Don't be confused or put off
that we provide similar methods with slighly dissimilar names:
intronsByTranscript()
exonsBy()
6) You are almost ready to call getSeq () with the coordinates returned at step 5.
But getSeq will not accept a GRangesList. It does accept a GRanges. So you
must unlist the intron and exon coordinates, shedding the "by transcript" structure.
7) getSeq returns a DNAStringSet corresponding to the GRanges (actually, the unlisted GRangesList)
you supplied.
8) This DNAstringSet now needs to be relisted, so that the by-transcript organization is
imposed, producing a DNAStringSetList whose elements each have a transcript name.
9) All of these intron and exon sequences can associated with the geneID/s you started with
by referring to the "transcripts.grL" you created at the very start.
On Jun 19, 2013, at 5:33 PM, Eric Foss [guest] wrote:
>
> Is there a way for me to use Bioconductor to take a list of gene names and give me back a list of genomic sequences, preferably with the exons and introns easily differentiable (e.g. exons in upper case and introns in lower case)?
>
> Thank you.
>
> Eric
>
>
> -- output of sessionInfo():
>
> not relevant
>
> --
> Sent via the guest posting facility at bioconductor.org.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list