[BioC] extract introns
Hervé Pagès
hpages at fhcrc.org
Sat Nov 19 00:02:08 CET 2011
Hi Yating,
On 11-11-16 09:26 AM, Yating Cheng wrote:
> Hi, It is lucky that I finally I got the sequences. But it is strange that
> I got two different results with two scripts. And another problem is that
> I can only get the sequences but how can I match them to the geneID.
> Because I am not sure whether these sequences are from the same gene. I
> have to get the intron sequences from a whole gene.
>
> The first one I tried
>
> R> bee.txdb<- makeTranscriptDbFromUCSC(genome="apiMel2",
> tablename="ensGene")
> R> introns<- intronsByTranscript(bee.txdb, use.names=TRUE)
> R> library(BSgenome.Amellifera.UCSC.apiMel2)
> R> all.introns<- unlist(introns)
> R> all.seqs<- getSeq(Amellifera, all.introns)
>
> With this script I got 148077 sequeces
>
> The second one was
> library(GenomicFeatures)
>
>
> txdb<- makeTranscriptDbFromUCSC("apiMel2", "genscan")
> introns<-intronsByTranscript(txdb)
> introns<- unlist(introns)
>
> intron.seqs<-(getSeq(Amellifera,introns,as.character=FALSE))
>
> intron_seqs<as.character(intron.seqs)
>
> With this script I got 80053 sequences.
>
> I think the first one makes sense, They should be the seperated introns,
> but I need the attached gene Names, so that I can do further calculation.
1. First extract the mapping between transcript ids and gene ids:
tx <- transcripts(bee.txdb, columns=c("tx_id", "tx_name", "gene_id"))
> elementMetadata(tx)[1:3, ]
DataFrame with 3 rows and 3 columns
tx_id tx_name gene_id
<integer> <character> <CompressedCharacterList>
1 1 ENSAPMT00000020635 ENSAPMG00000012500
2 5 ENSAPMT00000025701 ENSAPMG00000016403
3 3 ENSAPMT00000028256 ENSAPMG00000016403
Let's make sure that every transcript is mapped to exactly 1 gene:
> anyDuplicated(elementMetadata(tx)$tx_name)
[1] 0
> table(elementLengths(elementMetadata(tx)$gene))
1
27755
Good! So:
tx2gene <- unlist(elementMetadata(tx)$gene_id)
names(tx2gene) <- elementMetadata(tx)$tx_name
The tx-to-gene mapping is a named character vector:
> tx2gene[1:3]
ENSAPMT00000020635 ENSAPMT00000025701 ENSAPMT00000028256
"ENSAPMG00000012500" "ENSAPMG00000016403" "ENSAPMG00000016403"
2. Use this mapping to replace the names of your 'introns' object:
introns <- intronsByTranscript(bee.txdb, use.names=TRUE)
names(introns) <- tx2gene[names(introns)]
3. Unlist 'introns" and propagate the names "by hand":
all.introns <- unlist(introns, use.names=FALSE)
names(all.introns) <- rep(names(introns), elementLengths(introns))
4. When you extract the introns sequences, propagate the names again:
all.seqs <- getSeq(Amellifera, all.introns)
names(all.seqs) <- names(all.introns)
>
> Regarding the second one, I think I may did something wrong, but I could
> not figure it out.
I don't think you are doing anything wrong. You are using 2 different
UCSC tracks i.e. 2 different sets of annotations. Each of them is
coming from a different organization, that is, Ensembl for the "ensGene"
track, and the Genscan program (written by Chris Burge) for the
"genscan" track.
UCSC provides details about those tracks:
http://genome.ucsc.edu/cgi-bin/hgTrackUi?hgsid=223923545&c=GroupUn&g=ensGene
http://genome.ucsc.edu/cgi-bin/hgTrackUi?hgsid=223923545&c=GroupUn&g=genscan
That's life with annotations, I suppose. If there was 1 authoritative
set of annotations for a given organism that everybody agrees on,
Bioinformatics would be a very different world...
Another thing to keep in mind if you're doing sequence extraction from
a genome based on a set of annotations is that the reference genome
and the annotations must *match*, which was probably not the case in
your first attempts (when you started this thread) i.e. it seems that
you were using annotations you got from Ensembl (thru BiomaRt) without
checking which reference genome they were based on and then using a
BSgenome package for Bee that was not necessarily the correct one.
As Steve mentioned, by using makeTranscriptDbFromUCSC() and the
corresponding BSgenome package, you're safe.
Cheers,
H.
>
> Thanks
>
> Yating
>
>
>> Hi,
>>
>> On Tue, Nov 15, 2011 at 5:57 PM, Yating Cheng<yating.cheng at charite.de>
>> wrote:
>>> Hi Steve,
>>>
>>> I am working with Apis mellifera.
>>>
>>> Actually I am also trying to use the alignment function to get the
>>> intron
>>> sequence.which is much easier for me to understand. But there are some
>>> problem with that also.
>>
>> I'm not sure why you want to align sequences to find them on the
>> chromosome, when you already know where they are on the chromosome?
>>
>>> rm(list=ls(all=TRUE))
>>>
>>> # set working directory
>>> setwd("F:/Yating/introns")
>>>
>>> # load Biostrings library
>>> library(Biostrings)
>>>
>>> # load data
>>> genebody<-read.delim("genebody_5.txt",sep="\t",header=FALSE)
>>> exons<-read.delim("exon_5_seperated.txt",sep="\t",header=FALSE)
>>>
>>> # create object for storage of extracted introns
>>> introns<-data.frame()
>>>
>>> # loop over all genes
>>> for(i in 1:length(genebody[,1])){
>>>
>>> tmp.id<-as.vector(genebody[i,1])
>>> # get gene id
>>> tmp.subject<-as.vector(genebody[i,2])
>>> # get gene sequence
>>> tmp.exons<-exons[which(exons[,1]==tmp.id),]
>>> # get exons of the
>>> selected genes
>>>
>>> tmp.pattern<-as.vector(tmp.exons[,3])
>>> # define exons as patterns
>>> for alignment
>>> tmp.align<-pairwiseAlignment(pattern=tmp.pattern,
>>> subject=tmp.subject,type="local") # align all
>>> exons pairwise to gene
>>> sequence
>>> tmp.start<-start(subject(tmp.align))
>>> # vector of all alignment
>>> starts
>>> tmp.end<-end(subject(tmp.align))
>>> # vector of all
>>> alignment ends
>>>
>>>
>>>
>>> for(ex in 1:(length(tmp.end)-1)){
>>> # extract introns
>>>
>>>
>>> tmp.intron<-substr(tmp.subject,tmp.end[ex]+1,tmp.start[ex+1]-1)
>>>
>>> introns<-rbind(introns,cbind(tmp.id,tmp.end[ex]+1,tmp.start[ex+1]-1,tmp.intron))
>>>
>>> }
>>>
>>> }
>>>
>>> }
>>>
>>> # define useful names for columns
>>> colnames(introns)<-c("gene.id","pos.start","pos.end","intron")
>>>
>>> # write output
>>> write.table(introns,"Introns.txt", sep="\t", quote=FALSE,
>>> row.names=FALSE)
>>>
>>> The problem is the unvalid arguments in the substr....
>>
>> Well I can't tell for sure as you haven't told us what errors you are
>> getting, but the names of your variables suggest that you are using
>> substr with a number in its 2nd arg that is larger than the number in
>> its 3rd arg.
>>
>> I still think your life will be easier if you use GenomicFeatures to
>> download your gene annotations and then play with them that way. You
>> will have to invest sometime to learn a bit more about the
>> functionality provided by the IRanges, GRanges, GenomicFeatures
>> packages, but it will be time well spent since you will likely finding
>> yourself reinventing the functionality it provides anyway.
>>
>> For instance:
>>
>> R> library(GenomicFeatures)
>> R> bee.txdb<- makeTranscriptDbFromUCSC(genome="apiMel2",
>> tablename="ensGene")
>> R> introns<- intronsByTranscript(bee.txdb, use.names=TRUE)
>> R> head(introns)
>> GRangesList of length 6:
>> $ENSAPMT00000020635
>> GRanges with 1 range and 0 elementMetadata values:
>> seqnames ranges strand
>> <Rle> <IRanges> <Rle>
>> [1] Group1 [745, 821] +
>>
>> $ENSAPMT00000003962
>> GRanges with 2 ranges and 0 elementMetadata values:
>> seqnames ranges strand
>> [1] Group1 [2946, 3009] -
>> [2] Group1 [3155, 3331] -
>>
>> And to get the intronic sequence from "ENSAPMT00000020635" you can do:
>>
>> R> library(BSgenome.Amellifera.UCSC.apiMel2)
>> R> getSeq(Amellifera, introns[[1]])
>>
>> You can even get the sequences of all introns at once:
>>
>> R> all.introns<- unlist(introns)
>> R> all.seqs<- getSeq(Amellifera, all.introns)
>>
>> HTH,
>> -steve
>>
>> --
>> Steve Lianoglou
>> Graduate Student: Computational Systems Biology
>> | Memorial Sloan-Kettering Cancer Center
>> | Weill Medical College of Cornell University
>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list