[BioC] Elephant shark genome
Martin Morgan
mtmorgan at fhcrc.org
Wed Apr 23 14:22:22 CEST 2014
On 04/23/2014 03:06 AM, Miguel [guest] wrote:
>
> Is there going to be the elephant shark (Callorhinchus milii) genome, a model cartilaginous fish, stored in Biostrings objects, like other model genomes.
>
Hi Miguel -- not sure what your use case is, can you provide some context?
If you're just looking for fast access to the sequence data using Bioconductor
tools you could
1. download the fasta files, maybe (are these what you're looking for??)
wget
ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_other/Callorhinchus_milii/Callorhinchus_milii-6.1.3/Primary_Assembly/unplaced_scaffolds/FASTA/unplaced.scaf.fa.gz
2. In R, re-compress and index the file
library(Rsamtools)
fa = razip("unplaced.scaf.fa.gz")
indexFa(fa)
3. Use, e.g.,
> fafile = FaFile("unplaced.scaf.fa.rz")
> seqinfo(fafile)
Seqinfo of length 21203
seqnames seqlengths isCircular genome
gi|564982704|gb|KI635855.1| 18507834 <NA> <NA>
gi|564982701|gb|KI635856.1| 17031706 <NA> <NA>
gi|564982698|gb|KI635857.1| 16461339 <NA> <NA>
gi|564982691|gb|KI635858.1| 16433419 <NA> <NA>
gi|564982688|gb|KI635859.1| 15003573 <NA> <NA>
... ... ... ...
gi|564405817|gb|AAVX02067416.1| 247 <NA> <NA>
gi|564405816|gb|AAVX02067417.1| 234 <NA> <NA>
gi|564405815|gb|AAVX02067418.1| 218 <NA> <NA>
gi|564405814|gb|AAVX02067419.1| 173 <NA> <NA>
gi|564405813|gb|AAVX02067420.1| 66 <NA> <NA>
> idx = c("gi|564982701|gb|KI635856.1|", "gi|564982688|gb|KI635859.1|")
> which = as(seqinfo(fafile)[idx], "GRanges")
> getSeq(fafile, which)
A DNAStringSet instance of length 2
width seq names
[1] 17031706 TATAACTGGAGTGTATGTATAC...TGTACCGCCCGGGGTGGTGCG gi|564982701|gb|K...
[2] 15003573 AGAGAGAGATAGAGAGAGACAG...TTGATCATGTCAACCCCCCCA gi|564982688|gb|K...
There is also the vignette on creating a BSgenome package
http://bioconductor.org/packages/release/bioc/vignettes/BSgenome/inst/doc/BSgenomeForge.pdf
Martin
> Thank you
>
> -- output of sessionInfo():
>
> R version 2.14.1 (2011-12-22)
> Platform: i686-pc-linux-gnu (32-bit)
>
> locale:
> [1] LC_CTYPE=es_ES.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=es_ES.UTF-8 LC_COLLATE=es_ES.UTF-8
> [5] LC_MONETARY=es_ES.UTF-8 LC_MESSAGES=es_ES.UTF-8
> [7] LC_PAPER=C LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> --
> Sent via the guest posting facility at bioconductor.org.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the Bioconductor
mailing list