[BioC] extract introns

Yating Cheng yating.cheng at charite.de
Wed Nov 16 18:26:14 CET 2011


Hi, It is lucky that I finally I got the sequences. But it is strange that
I got two different results with two scripts. And another problem is that
I can only get the sequences but how can I match them to the geneID.
Because I am not sure whether these sequences are from the same gene. I
have to get the intron sequences from a whole gene.

The first one I tried

R> bee.txdb <- makeTranscriptDbFromUCSC(genome="apiMel2",
 tablename="ensGene")
R> introns <- intronsByTranscript(bee.txdb, use.names=TRUE)
 R> library(BSgenome.Amellifera.UCSC.apiMel2)
 R> all.introns <- unlist(introns)
R> all.seqs <- getSeq(Amellifera, all.introns)

With this script I got 148077 sequeces

The second one was
library(GenomicFeatures)


txdb <- makeTranscriptDbFromUCSC("apiMel2", "genscan")
introns<-intronsByTranscript(txdb)
introns <- unlist(introns)

intron.seqs <-(getSeq(Amellifera,introns,as.character=FALSE))

intron_seqs<as.character(intron.seqs)

With this script I got 80053 sequences.

I think the first one makes sense, They should be the seperated introns,
but I need the attached gene Names, so that I can do further calculation.

Regarding the second one, I think I may did something wrong, but I could
not figure it out.

Thanks

Yating


> Hi,
>
> On Tue, Nov 15, 2011 at 5:57 PM, Yating Cheng <yating.cheng at charite.de>
> wrote:
>> Hi Steve,
>>
>> I am working with Apis mellifera.
>>
>> Actually I am also trying to use the alignment function to get the
>> intron
>> sequence.which is much easier for me to understand.  But there are some
>> problem with that also.
>
> I'm not sure why you want to align sequences to find them on the
> chromosome, when you already know where they are on the chromosome?
>
>> rm(list=ls(all=TRUE))
>>
>> # set working directory
>> setwd("F:/Yating/introns")
>>
>> # load Biostrings library
>> library(Biostrings)
>>
>> # load data
>> genebody<-read.delim("genebody_5.txt",sep="\t",header=FALSE)
>> exons<-read.delim("exon_5_seperated.txt",sep="\t",header=FALSE)
>>
>> # create object for storage of extracted introns
>> introns<-data.frame()
>>
>> # loop over all genes
>> for(i in 1:length(genebody[,1])){
>>
>>        tmp.id<-as.vector(genebody[i,1])                                
>>                                                        # get gene id
>>        tmp.subject<-as.vector(genebody[i,2])                            
>>                                               # get gene sequence
>>        tmp.exons<-exons[which(exons[,1]==tmp.id),]                      
>>                                               # get exons of the
>> selected genes
>>
>>        tmp.pattern<-as.vector(tmp.exons[,3])                            
>>                                               # define exons as patterns
>> for alignment
>>        tmp.align<-pairwiseAlignment(pattern=tmp.pattern,
>> subject=tmp.subject,type="local")                       # align all
>> exons pairwise to gene
>> sequence
>>        tmp.start<-start(subject(tmp.align))                            
>>                                                # vector of all alignment
>> starts
>>        tmp.end<-end(subject(tmp.align))                                
>>                                                        # vector of all
>> alignment ends
>>
>>
>>
>>        for(ex in 1:(length(tmp.end)-1)){                                
>>                                                       # extract introns
>>
>>              
>>  tmp.intron<-substr(tmp.subject,tmp.end[ex]+1,tmp.start[ex+1]-1)
>>              
>>  introns<-rbind(introns,cbind(tmp.id,tmp.end[ex]+1,tmp.start[ex+1]-1,tmp.intron))
>>
>>                }
>>
>>                }
>>
>>        }
>>
>> # define useful names for columns
>> colnames(introns)<-c("gene.id","pos.start","pos.end","intron")
>>
>> # write output
>> write.table(introns,"Introns.txt", sep="\t", quote=FALSE,
>> row.names=FALSE)
>>
>> The problem is the unvalid arguments in the substr....
>
> Well I can't tell for sure as you haven't told us what errors you are
> getting, but the names of your variables suggest that you are using
> substr with a number in its 2nd arg that is larger than the number in
> its 3rd arg.
>
> I still think your life will be easier if you use GenomicFeatures to
> download your gene annotations and then play with them that way. You
> will have to invest sometime to learn a bit more about the
> functionality provided by the IRanges, GRanges, GenomicFeatures
> packages, but it will be time well spent since you will likely finding
> yourself reinventing the functionality it provides anyway.
>
> For instance:
>
> R> library(GenomicFeatures)
> R> bee.txdb <- makeTranscriptDbFromUCSC(genome="apiMel2",
> tablename="ensGene")
> R> introns <- intronsByTranscript(bee.txdb, use.names=TRUE)
> R> head(introns)
> GRangesList of length 6:
> $ENSAPMT00000020635
> GRanges with 1 range and 0 elementMetadata values:
>       seqnames     ranges strand
>          <Rle>  <IRanges>  <Rle>
>   [1]   Group1 [745, 821]      +
>
> $ENSAPMT00000003962
> GRanges with 2 ranges and 0 elementMetadata values:
>       seqnames       ranges strand
>   [1]   Group1 [2946, 3009]      -
>   [2]   Group1 [3155, 3331]      -
>
> And to get the intronic sequence from "ENSAPMT00000020635" you can do:
>
> R> library(BSgenome.Amellifera.UCSC.apiMel2)
> R> getSeq(Amellifera, introns[[1]])
>
> You can even get the sequences of all introns at once:
>
> R> all.introns <- unlist(introns)
> R> all.seqs <- getSeq(Amellifera, all.introns)
>
> HTH,
> -steve
>
> --
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
>  | Memorial Sloan-Kettering Cancer Center
>  | Weill Medical College of Cornell University
> Contact Info: http://cbio.mskcc.org/~lianos/contact
>



More information about the Bioconductor mailing list