[BioC] extract introns
Yating Cheng
yating.cheng at charite.de
Wed Nov 16 18:26:14 CET 2011
Hi, It is lucky that I finally I got the sequences. But it is strange that
I got two different results with two scripts. And another problem is that
I can only get the sequences but how can I match them to the geneID.
Because I am not sure whether these sequences are from the same gene. I
have to get the intron sequences from a whole gene.
The first one I tried
R> bee.txdb <- makeTranscriptDbFromUCSC(genome="apiMel2",
tablename="ensGene")
R> introns <- intronsByTranscript(bee.txdb, use.names=TRUE)
R> library(BSgenome.Amellifera.UCSC.apiMel2)
R> all.introns <- unlist(introns)
R> all.seqs <- getSeq(Amellifera, all.introns)
With this script I got 148077 sequeces
The second one was
library(GenomicFeatures)
txdb <- makeTranscriptDbFromUCSC("apiMel2", "genscan")
introns<-intronsByTranscript(txdb)
introns <- unlist(introns)
intron.seqs <-(getSeq(Amellifera,introns,as.character=FALSE))
intron_seqs<as.character(intron.seqs)
With this script I got 80053 sequences.
I think the first one makes sense, They should be the seperated introns,
but I need the attached gene Names, so that I can do further calculation.
Regarding the second one, I think I may did something wrong, but I could
not figure it out.
Thanks
Yating
> Hi,
>
> On Tue, Nov 15, 2011 at 5:57 PM, Yating Cheng <yating.cheng at charite.de>
> wrote:
>> Hi Steve,
>>
>> I am working with Apis mellifera.
>>
>> Actually I am also trying to use the alignment function to get the
>> intron
>> sequence.which is much easier for me to understand. But there are some
>> problem with that also.
>
> I'm not sure why you want to align sequences to find them on the
> chromosome, when you already know where they are on the chromosome?
>
>> rm(list=ls(all=TRUE))
>>
>> # set working directory
>> setwd("F:/Yating/introns")
>>
>> # load Biostrings library
>> library(Biostrings)
>>
>> # load data
>> genebody<-read.delim("genebody_5.txt",sep="\t",header=FALSE)
>> exons<-read.delim("exon_5_seperated.txt",sep="\t",header=FALSE)
>>
>> # create object for storage of extracted introns
>> introns<-data.frame()
>>
>> # loop over all genes
>> for(i in 1:length(genebody[,1])){
>>
>> tmp.id<-as.vector(genebody[i,1])
>> # get gene id
>> tmp.subject<-as.vector(genebody[i,2])
>> # get gene sequence
>> tmp.exons<-exons[which(exons[,1]==tmp.id),]
>> # get exons of the
>> selected genes
>>
>> tmp.pattern<-as.vector(tmp.exons[,3])
>> # define exons as patterns
>> for alignment
>> tmp.align<-pairwiseAlignment(pattern=tmp.pattern,
>> subject=tmp.subject,type="local") # align all
>> exons pairwise to gene
>> sequence
>> tmp.start<-start(subject(tmp.align))
>> # vector of all alignment
>> starts
>> tmp.end<-end(subject(tmp.align))
>> # vector of all
>> alignment ends
>>
>>
>>
>> for(ex in 1:(length(tmp.end)-1)){
>> # extract introns
>>
>>
>> tmp.intron<-substr(tmp.subject,tmp.end[ex]+1,tmp.start[ex+1]-1)
>>
>> introns<-rbind(introns,cbind(tmp.id,tmp.end[ex]+1,tmp.start[ex+1]-1,tmp.intron))
>>
>> }
>>
>> }
>>
>> }
>>
>> # define useful names for columns
>> colnames(introns)<-c("gene.id","pos.start","pos.end","intron")
>>
>> # write output
>> write.table(introns,"Introns.txt", sep="\t", quote=FALSE,
>> row.names=FALSE)
>>
>> The problem is the unvalid arguments in the substr....
>
> Well I can't tell for sure as you haven't told us what errors you are
> getting, but the names of your variables suggest that you are using
> substr with a number in its 2nd arg that is larger than the number in
> its 3rd arg.
>
> I still think your life will be easier if you use GenomicFeatures to
> download your gene annotations and then play with them that way. You
> will have to invest sometime to learn a bit more about the
> functionality provided by the IRanges, GRanges, GenomicFeatures
> packages, but it will be time well spent since you will likely finding
> yourself reinventing the functionality it provides anyway.
>
> For instance:
>
> R> library(GenomicFeatures)
> R> bee.txdb <- makeTranscriptDbFromUCSC(genome="apiMel2",
> tablename="ensGene")
> R> introns <- intronsByTranscript(bee.txdb, use.names=TRUE)
> R> head(introns)
> GRangesList of length 6:
> $ENSAPMT00000020635
> GRanges with 1 range and 0 elementMetadata values:
> seqnames ranges strand
> <Rle> <IRanges> <Rle>
> [1] Group1 [745, 821] +
>
> $ENSAPMT00000003962
> GRanges with 2 ranges and 0 elementMetadata values:
> seqnames ranges strand
> [1] Group1 [2946, 3009] -
> [2] Group1 [3155, 3331] -
>
> And to get the intronic sequence from "ENSAPMT00000020635" you can do:
>
> R> library(BSgenome.Amellifera.UCSC.apiMel2)
> R> getSeq(Amellifera, introns[[1]])
>
> You can even get the sequences of all introns at once:
>
> R> all.introns <- unlist(introns)
> R> all.seqs <- getSeq(Amellifera, all.introns)
>
> HTH,
> -steve
>
> --
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
> | Memorial Sloan-Kettering Cancer Center
> | Weill Medical College of Cornell University
> Contact Info: http://cbio.mskcc.org/~lianos/contact
>
More information about the Bioconductor
mailing list