[BioC] extract introns

Tue Nov 15 23:57:31 CET 2011

Hi Steve,

I am working with Apis mellifera.

Actually I am also trying to use the alignment function to get the intron
sequence.which is much easier for me to understand.  But there are some
problem with that also.

rm(list=ls(all=TRUE))

# set working directory
setwd("F:/Yating/introns")

# load Biostrings library
library(Biostrings)

# load data
genebody<-read.delim("genebody_5.txt",sep="\t",header=FALSE)
exons<-read.delim("exon_5_seperated.txt",sep="\t",header=FALSE)

# create object for storage of extracted introns
introns<-data.frame()

# loop over all genes
for(i in 1:length(genebody[,1])){

	tmp.id<-as.vector(genebody[i,1])											# get gene id
	tmp.subject<-as.vector(genebody[i,2])										# get gene sequence
	tmp.exons<-exons[which(exons[,1]==tmp.id),]									# get exons of the
selected genes

	tmp.pattern<-as.vector(tmp.exons[,3])										# define exons as patterns
for alignment
	tmp.align<-pairwiseAlignment(pattern=tmp.pattern,
subject=tmp.subject,type="local")			# align all exons pairwise to gene
sequence
	tmp.start<-start(subject(tmp.align))										# vector of all alignment
starts
	tmp.end<-end(subject(tmp.align))											# vector of all alignment ends

	for(ex in 1:(length(tmp.end)-1)){											# extract introns

		tmp.intron<-substr(tmp.subject,tmp.end[ex]+1,tmp.start[ex+1]-1)
		introns<-rbind(introns,cbind(tmp.id,tmp.end[ex]+1,tmp.start[ex+1]-1,tmp.intron))

		}

		}

	}

# define useful names for columns
colnames(introns)<-c("gene.id","pos.start","pos.end","intron")

# write output
write.table(introns,"Introns.txt", sep="\t", quote=FALSE, row.names=FALSE)

The problem is the unvalid arguments in the substr....

Best Wishes

Yating

> HI Yating,
>
> On Tue, Nov 15, 2011 at 4:48 PM, Yating Cheng <yating.cheng at charite.de>
> wrote:
>> Hi, steve
>>
>> Thanks for your answer. About converting the names of chromosome, it is
>> kind of difficult.
>>
>> I need the intron sequences for every gene, but the chromosome names of
>> some genes I got from biomart are the same.
>
> Sorry, I'm confused.
>
> What organism are you working with?
>
> Can you construct a TranscriptDb for it using the GenomicFeatures
> package (you'll have to read through the GenomicFeatures vignette).
>
>> I am not sure about how can I get specific genes from BS genome, and
>> match
>> them to gene-IDs.
>
> BSgenome packages do not have gene specific information, they only
> contain sequence information for different organisms.
>
> The fact that you can use a BSgenome.* package for your organism of
> interest suggests that you should be able to build a TranscriptDb for
> it, so that's good news ...
>
> -steve
>
> --
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
>  | Memorial Sloan-Kettering Cancer Center
>  | Weill Medical College of Cornell University
> Contact Info: http://cbio.mskcc.org/~lianos/contact
>