[Bioc-devel] custom genome
Martin Morgan
mtmorgan at fredhutch.org
Sat May 9 19:43:31 CEST 2015
On 05/09/2015 08:36 AM, Vincent Carey wrote:
> did you put this request on the forum? did you follow the steps with
> BSforge and find
> an instruction that could not be followed or led to an error? if you did,
> please put an
> account of that event on the support.bioconductor.org and you will get
> assistance. read
> the posting guide and provide sessionInfo() so that your errors can be
> reproduced.
Vince is right that the question should probably be asked on the support forum,
but for what it's worth I downloaded the file(?) to my current working directory
url = "ftp://ftp.plantgdb.org/pub/Genomes/PpGDB/Ppatens_152.fa.gz"
fl = basename(url)
download.file(url, fl, mode="wb")
It's small enough that it can be read in to memory.
library(Biostrings)
dna = readDNAStringSet(fl)
but it would be convenient to have quick random access to to the sequences, so I
exported it to "2bit' format
library(BSgenome); library(rtracklayer)
twobit = TwoBitFile(sub(".gz", ".2bit", fl)))
export(dna, twobit)
and then I have ready access to the sequences, e.g.,
> seqinfo(twobit)
Seqinfo object with 2106 sequences from an unspecified genome:
seqnames seqlengths isCircular genome
scaffold_1 5387533 <NA> <NA>
scaffold_2 4973254 <NA> <NA>
scaffold_3 4236991 <NA> <NA>
scaffold_4 4046325 <NA> <NA>
scaffold_5 3779828 <NA> <NA>
... ... ... ...
scaffold_11864 1003 <NA> <NA>
scaffold_11874 1002 <NA> <NA>
scaffold_11881 1000 <NA> <NA>
scaffold_11890 1000 <NA> <NA>
scaffold_11893 1000 <NA> <NA>
> query = GRanges("scaffold_109", IRanges(seq(1, 1000, by=100), width=50))
> getSeq(twobit, query)
A DNAStringSet instance of length 10
width seq
[1] 50 GTTACAGATTGGTAAGATCTTGTAGAGGCTCTTTTACTCTGGGCATATAT
[2] 50 AAAAAAATTGATAGTATAAGAGAAGGAAATTCTTTCAAGAGACAAATTTG
[3] 50 TTTTTACAATATTGGAAAAAAGAAAAAAAAAAAAAAAAAATGAGAAAAGA
[4] 50 CATATTTAAAGGGAGTAATTGAAAAACATGTAGTAAATGCTAAGATAGAA
[5] 50 AATATTTGATATATAATTCACTATATAATATAGTGAGAGTTATTTTATTA
[6] 50 GTTTAAAATGGCCATTGTTGAAGTTGATTTTAGGTTGATAATAGTTTTTA
[7] 50 TTATTTCTTTTATTTGATTTTTAGAACATATATTATATAAATCTTATAAA
[8] 50 ATATGGATGAATGGAGATGGATGATGAAGTTCTCCATTGCTATTTCCAAA
[9] 50 TATGATCACTCTTTCAGGGCATTTAAACTTAAGAGACTAAGATCAAGAAG
[10] 50 CTTTATAAAAAAAAATATAGGCATGCTATTGATGCAAAGATATATCAGGA
> query$seq = getSeq(twobit, query)
Hope that's helpful; I might have made different decisions with a larger genome
or on Windows, so if you re-post your question on the support forum (which I
hope you do) it would be helpful to include that information as well as your
intended use.
Martin
>
> On Fri, May 8, 2015 at 7:34 PM, Eric P <pedereri at gmail.com> wrote:
>
>> Hi
>>
>> I have 2-3 genomes that are not listed that I would like to use with
>> bioconductor and I do not really understand how to prepare one properly. I
>> know you have to use BSforge, but after that I need some help. One of the
>> genomes is *Physcomitrella patens*, which is actually being updated at the
>> moment, so I am waiting for version 3.1 to come out, but if someone has
>> time they could add version 1.6 or 3.0.
>> However, the other two genomes I have to do myself as they will not be
>> published for a while so I could just as well try a third. Any help would
>> be appreciated, thanks. Also, if this was already posted I apologize but I
>> could not find enough information on this subject when I searched the
>> forums.
>>
>> --
>> Cheers,
>> Eric
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the Bioc-devel
mailing list