[BioC] extract introns

Wed Nov 16 04:42:15 CET 2011

Hi,

On Tue, Nov 15, 2011 at 5:57 PM, Yating Cheng <yating.cheng at charite.de> wrote:
> Hi Steve,
>
> I am working with Apis mellifera.
>
> Actually I am also trying to use the alignment function to get the intron
> sequence.which is much easier for me to understand.  But there are some
> problem with that also.

I'm not sure why you want to align sequences to find them on the
chromosome, when you already know where they are on the chromosome?

> rm(list=ls(all=TRUE))
>
> # set working directory
> setwd("F:/Yating/introns")
>
> # load Biostrings library
> library(Biostrings)
>
> # load data
> genebody<-read.delim("genebody_5.txt",sep="\t",header=FALSE)
> exons<-read.delim("exon_5_seperated.txt",sep="\t",header=FALSE)
>
> # create object for storage of extracted introns
> introns<-data.frame()
>
> # loop over all genes
> for(i in 1:length(genebody[,1])){
>
>        tmp.id<-as.vector(genebody[i,1])                                                                                        # get gene id
>        tmp.subject<-as.vector(genebody[i,2])                                                                           # get gene sequence
>        tmp.exons<-exons[which(exons[,1]==tmp.id),]                                                                     # get exons of the
> selected genes
>
>        tmp.pattern<-as.vector(tmp.exons[,3])                                                                           # define exons as patterns
> for alignment
>        tmp.align<-pairwiseAlignment(pattern=tmp.pattern,
> subject=tmp.subject,type="local")                       # align all exons pairwise to gene
> sequence
>        tmp.start<-start(subject(tmp.align))                                                                            # vector of all alignment
> starts
>        tmp.end<-end(subject(tmp.align))                                                                                        # vector of all alignment ends
>
>
>
>        for(ex in 1:(length(tmp.end)-1)){                                                                                       # extract introns
>
>                tmp.intron<-substr(tmp.subject,tmp.end[ex]+1,tmp.start[ex+1]-1)
>                introns<-rbind(introns,cbind(tmp.id,tmp.end[ex]+1,tmp.start[ex+1]-1,tmp.intron))
>
>                }
>
>                }
>
>        }
>
> # define useful names for columns
> colnames(introns)<-c("gene.id","pos.start","pos.end","intron")
>
> # write output
> write.table(introns,"Introns.txt", sep="\t", quote=FALSE, row.names=FALSE)
>
> The problem is the unvalid arguments in the substr....

Well I can't tell for sure as you haven't told us what errors you are
getting, but the names of your variables suggest that you are using
substr with a number in its 2nd arg that is larger than the number in
its 3rd arg.

I still think your life will be easier if you use GenomicFeatures to
download your gene annotations and then play with them that way. You
will have to invest sometime to learn a bit more about the
functionality provided by the IRanges, GRanges, GenomicFeatures
packages, but it will be time well spent since you will likely finding
yourself reinventing the functionality it provides anyway.

For instance:

R> library(GenomicFeatures)
R> bee.txdb <- makeTranscriptDbFromUCSC(genome="apiMel2", tablename="ensGene")
R> introns <- intronsByTranscript(bee.txdb, use.names=TRUE)
R> head(introns)
GRangesList of length 6:
$ENSAPMT00000020635
GRanges with 1 range and 0 elementMetadata values:
      seqnames     ranges strand
         <Rle>  <IRanges>  <Rle>
  [1]   Group1 [745, 821]      +

$ENSAPMT00000003962
GRanges with 2 ranges and 0 elementMetadata values:
      seqnames       ranges strand
  [1]   Group1 [2946, 3009]      -
  [2]   Group1 [3155, 3331]      -

And to get the intronic sequence from "ENSAPMT00000020635" you can do:

R> library(BSgenome.Amellifera.UCSC.apiMel2)
R> getSeq(Amellifera, introns[[1]])

You can even get the sequences of all introns at once:

R> all.introns <- unlist(introns)
R> all.seqs <- getSeq(Amellifera, all.introns)

HTH,
-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact