[BioC] how to use rtracklayer to retrieve sequence of list of coordinates
Kasper Daniel Hansen
kasperdanielhansen at gmail.com
Tue Mar 2 15:27:54 CET 2010
One thing you want to be careful with when you do stuff like this, is
that the annotation matches up _exactly_ with the genome you are
using. This really bit me in S. cerevisiae. Essentially even a small
indel will make everything out of phase. And you really need to watch
out for different versions of the genome. Sometimes extra stuff gets
resolved at the end of the chromosomes.
In this case it seems you are using UCSC annotation and BSgenome so it
should work. But I urge you to double check. This is something that
can be very hard to track down.
Kasper
On Tue, Mar 2, 2010 at 9:21 AM, sabrina s <sabrina.shao at gmail.com> wrote:
> Hi, Michael:
>
> See bellow:
>
>
> Error in IRanges:::.normargSEW(start, "start") :
> 'start' must be a vector of integers
>> sessionInfo()
> R version 2.10.0 (2009-10-26)
> i386-pc-mingw32
>
> locale:
> [1] LC_COLLATE=English_United States.1252
> [2] LC_CTYPE=English_United States.1252
> [3] LC_MONETARY=English_United States.1252
> [4] LC_NUMERIC=C
> [5] LC_TIME=English_United States.1252
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] BSgenome.Mmusculus.UCSC.mm9_1.3.16 BSgenome_1.14.2
> [3] Biostrings_2.14.5 rtracklayer_1.6.0
> [5] RCurl_1.3-1 bitops_1.0-4.1
> [7] IRanges_1.4.10
>
> loaded via a namespace (and not attached):
> [1] Biobase_2.6.0 XML_2.6-0
>
>
> Thanks!
>
> Sabrina
>
>
> On Tue, Mar 2, 2010 at 8:49 AM, Michael Lawrence
> <lawrence.michael at gene.com>wrote:
>
>> Could you paste your sessionInfo() output please?
>>
>>
>> On Mon, Mar 1, 2010 at 7:37 PM, sabrina s <sabrina.shao at gmail.com> wrote:
>>
>>> Hi, Michael:
>>>
>>> I tried as you suggested, but I had the following error:
>>>
>>> Error in IRanges:::.normargSEW(start, "start") :
>>> 'start' must be a vector of integers
>>>
>>> my testTrack looks like this:
>>>
>>> RangedData with 76 rows and...
>>> space ranges | strand
>>> <character> <IRanges> | <factor>
>>> 1 chr1 [137159091, 137159290] | -
>>> 2 chr1 [155302742, 155302941] | -
>>> etc
>>>
>>> Any suggestions? Thanks!
>>>
>>> Sabrina
>>>
>>>
>>> On Mon, Mar 1, 2010 at 6:02 PM, Michael Lawrence <
>>> lawrence.michael at gene.com> wrote:
>>>
>>>>
>>>>
>>>> On Mon, Mar 1, 2010 at 1:02 PM, sabrina s <sabrina.shao at gmail.com>wrote:
>>>>
>>>>> Hi, Steve:
>>>>> I am trying to get it worked without separating the chromosomes. I used
>>>>> the
>>>>> following code:
>>>>> seqQuery<-data.frame(
>>>>> chr=currCoorList$chr,
>>>>> start=currCoorList$starts,
>>>>> end=currCoorList$ends,
>>>>> strand=currCoorList$strand)
>>>>> mySeq<- getSeq(Mmusculus, seqQuery$chr,
>>>>> start=seqQuery$start, end=seqQuery$end,
>>>>> strand=seqQuery$strand )
>>>>>
>>>>>
>>>>> But then I got an error:
>>>>> Error in reverseComplement(DNAStringSet(seqs[ii])) :
>>>>> could not find function "copy"
>>>>>
>>>>
>>>> Did I miss some depending package?
>>>>>
>>>>>
>>>> Don't know what's up with that, but you don't need to use a data.frame
>>>> there. You already had a RangedData above, just pass that to getSeq, i.e.
>>>>
>>>> getSeq(Mmusculus, testTrack)
>>>>
>>>> Sabrina
>>>>>
>>>>> On Mon, Mar 1, 2010 at 2:31 PM, Steve Lianoglou <
>>>>> mailinglist.honeypot at gmail.com> wrote:
>>>>>
>>>>> > Hi Sabrina,
>>>>> >
>>>>> > On Mon, Mar 1, 2010 at 2:04 PM, sabrina s <sabrina.shao at gmail.com>
>>>>> wrote:
>>>>> > > Hi,all:
>>>>> > > I have a long list of coordinates, which was created by the
>>>>> following
>>>>> > code:
>>>>> > > testIranges<- IRanges(start=currCoorList $starts,
>>>>> > > end=currCoorLis$ends)
>>>>> > >
>>>>> > > testTrack<-GenomicData(testIranges,chrom=currCoorList$chr,
>>>>> > > strand=currCoorList$strand,genome="mm9")
>>>>> > >
>>>>> > > I could then export the testTrack into .bed file and upload in the
>>>>> UCSC
>>>>> > > genome browser and get the sequences of the listed coordinate
>>>>> > > But I was thinking if there is a function from rtracklayer to
>>>>> retrieve
>>>>> > > sequence directly.
>>>>> >
>>>>> > If you're trying to get coordinates from some reference genome, you
>>>>> > can use the BSgenome.*.* packages and skip rtracklayer + internet
>>>>> > queries entirely.
>>>>> >
>>>>> > For instance, say I have an IRanges object `r1` which is a set of
>>>>> > intervals on chromosome 1 from hg18 that I want sequence information
>>>>> > for, I could do it like this:
>>>>> >
>>>>> > R> r1
>>>>> > IRanges of length 589
>>>>> > start end width
>>>>> > [1] 18016124 18016155 32
>>>>> > [2] 18020749 18020780 32
>>>>> > [3] 18024599 18024630 32
>>>>> > [4] 18024830 18024861 32
>>>>> > [5] 18051985 18052016 32
>>>>> > [6] 18064760 18064791 32
>>>>> > [7] 18088145 18088176 32
>>>>> > [8] 18088200 18088231 32
>>>>> > [9] 18110085 18110116 32
>>>>> > ...
>>>>> >
>>>>> > R> library(BSgenome.Hsapiens.UCSC.hg18)
>>>>> > R> chr1 <- unmasked(HSapiens$chr1)
>>>>> > R> myseqs <- Views(chr1, start=start(r1), end=end(r1))
>>>>> > R> myseqs
>>>>> > Views on a 247249719-letter DNAString subject
>>>>> > subject:
>>>>> >
>>>>> TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>>>>> > views:
>>>>> > start end width
>>>>> > [1] 18016124 18016155 32 [GTTTTTTGTTTTGTTTTGTTTTTTGTTTTTTA]
>>>>> > [2] 18020749 18020780 32 [CGCCTAGTCTGGAGTGCAGTGGCACAATCTTG]
>>>>> > [3] 18024599 18024630 32 [TCTCACCTCTTCACCGAAGTGGTCTTCTGAAC]
>>>>> > [4] 18024830 18024861 32 [GAATTCTCCCTCGTTGCAGATCCTGTTTGAGT]
>>>>> > [5] 18051985 18052016 32 [TTATTTTTTTTGAGGAAGAGTCTTGCTCTGTC]
>>>>> > [6] 18064760 18064791 32 [GTGGGAGGATCGCTTGAGCCCTGGAGGTTGGG]
>>>>> > [7] 18088145 18088176 32 [ATCATGGTGGGAGATGAAAGGCACTTCTTACA]
>>>>> > [8] 18088200 18088231 32 [GAAGAAGCAAAAGCGGAAACCCCTGCTAAACC]
>>>>> > [9] 18110085 18110116 32 [GGATCATGAGTTCAAGAGTTCGAGACCAGCCT]
>>>>> > [10] 18118454 18118485 32 [TGTTGCCTAGGCTGGTCTCAAACTCCTGCCCT]
>>>>> > ...
>>>>> >
>>>>> > calling "as.character" on the myseqs object will give you the sequence
>>>>> > as a character vector.
>>>>> >
>>>>> > Does that get you what you're after?
>>>>> > -steve
>>>>> >
>>>>> > --
>>>>> > Steve Lianoglou
>>>>> > Graduate Student: Computational Systems Biology
>>>>> > | Memorial Sloan-Kettering Cancer Center
>>>>> > | Weill Medical College of Cornell University
>>>>> > Contact Info: http://cbio.mskcc.org/~lianos/contact<http://cbio.mskcc.org/%7Elianos/contact>
>>>>> <http://cbio.mskcc.org/%7Elianos/contact>
>>>>> >
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Sabrina
>>>>>
>>>>> [[alternative HTML version deleted]]
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at stat.math.ethz.ch
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Sabrina
>>>
>>
>>
>
>
> --
> Sabrina
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
More information about the Bioconductor
mailing list