[BioC] extracting sequence from a genome
Iain Gallagher
iaingallagher at btopenworld.com
Thu Mar 15 19:16:39 CET 2012
Hi Gentlemen
Thanks for taking the time. Here's where I am just now:
library(BSgenome.Rnorvegicus.UCSC.rn4)# get genome
library(GenomicRanges)
fl <- "ftp://mirbase.org/pub/mirbase/CURRENT/genomes/rno.gff" # get miR coords
gff <- read.table(fl) # as dataframe
names <- gff[,10]
nms <- gsub(";", "", gsub("\"", "", gsub("ID=\"", "", names))) # a set of nested gsub with regex to leave only the text in the double quotes
gr <- GRanges(seqnames = Rle(paste('chr', gff[,1], sep='')), ranges = IRanges(gff[,4], end = gff[,5], names = nms), strand = Rle(gff[,7]))
seqs <- getSeq(Rnorvegicus, flank(gr, 200))
names(seqs) <- nms
It's much lighter on it's feet than a loop and a nice intro to the GenomicRanges package for me.
As a follow-up question I'm going to write out the seqs object as fasta and use it in clover for TFBS analysis.
I was wondering whether the strand is taken into account when I get the flanking sequence i.e. is the sequence returned from the + or - strand as defined in the GRanges object?
I only ask this because presumably that will affect my TFBS analysis and if so I might want to reverse / complement all those sequences that are upstream of miRs transcribed from the - strand.
Best
iain
________________________________
From: "Tim Triche, Jr." <tim.triche at gmail.com>
To: Steve Lianoglou <mailinglist.honeypot at gmail.com>
Cc: Iain Gallagher <iaingallagher at btopenworld.com>; bioconductor <bioconductor at stat.math.ethz.ch>
Sent: Thursday, 15 March 2012, 17:44
Subject: Re: [BioC] extracting sequence from a genome
Yeah I just realized that when I pasted it in to execute it. Trying to find where exactly I did this in the past (because it worked back then)
On Thu, Mar 15, 2012 at 10:42 AM, Steve Lianoglou <mailinglist.honeypot at gmail.com> wrote:
Howdy,
>
>
>On Thu, Mar 15, 2012 at 1:10 PM, Tim Triche, Jr. <tim.triche at gmail.com> wrote:
>> what you want to do is create a Ranges object and then get a View of the
>> sequences.
>>
>>
>> library(Biostrings)
>> library(GenomicRanges)
>> mirs <- GRanges( paste('chr', mirInfo[i,1], sep=''),
>> IRanges(start=mirInfo[i,4], width=1),
>> strand=whereverYouKeepTheStrand )
>> upstream <- flank(mirs, 200)
>> upseqs <- Views(Rnorvegicus, mirs)
>> show(upseqs)
>>
>> Untested because I am lazy as hell, but it should work, and be a LOT faster.
>
>Maybe in R 2.16 ...
>
>R> Views(Hsapiens, GRanges(c('chr1', 'chr2'), IRanges(c(6e6, 6e6),
>width=10), '+'))
>Error in function (classes, fdef, mtable) :
> unable to find an inherited method for function "Views", for
>signature "BSgenome"
>
>;-)
>
>-steve
>
>--
>Steve Lianoglou
>Graduate Student: Computational Systems Biology
> | Memorial Sloan-Kettering Cancer Center
> | Weill Medical College of Cornell University
>Contact Info: http://cbio.mskcc.org/~lianos/contact
>
--
A model is a lie that helps you see the truth.
Howard Skipper
More information about the Bioconductor
mailing list