[BioC] Elephant shark genome

Martin Morgan mtmorgan at fhcrc.org
Wed Apr 23 14:22:22 CEST 2014


On 04/23/2014 03:06 AM, Miguel [guest] wrote:
>
> Is there going to be the elephant shark (Callorhinchus milii) genome, a model cartilaginous fish, stored in Biostrings objects, like other model genomes.
>

Hi Miguel -- not sure what your use case is, can you provide some context?

If you're just looking for fast access to the sequence data using Bioconductor 
tools you could

   1. download the fasta files, maybe (are these what you're looking for??)

wget 
ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_other/Callorhinchus_milii/Callorhinchus_milii-6.1.3/Primary_Assembly/unplaced_scaffolds/FASTA/unplaced.scaf.fa.gz

    2. In R, re-compress and index the file

        library(Rsamtools)
        fa = razip("unplaced.scaf.fa.gz")
        indexFa(fa)

     3. Use, e.g.,

 > fafile = FaFile("unplaced.scaf.fa.rz")
 > seqinfo(fafile)
Seqinfo of length 21203
seqnames                        seqlengths isCircular genome
gi|564982704|gb|KI635855.1|       18507834       <NA>   <NA>
gi|564982701|gb|KI635856.1|       17031706       <NA>   <NA>
gi|564982698|gb|KI635857.1|       16461339       <NA>   <NA>
gi|564982691|gb|KI635858.1|       16433419       <NA>   <NA>
gi|564982688|gb|KI635859.1|       15003573       <NA>   <NA>
...                                    ...        ...    ...
gi|564405817|gb|AAVX02067416.1|        247       <NA>   <NA>
gi|564405816|gb|AAVX02067417.1|        234       <NA>   <NA>
gi|564405815|gb|AAVX02067418.1|        218       <NA>   <NA>
gi|564405814|gb|AAVX02067419.1|        173       <NA>   <NA>
gi|564405813|gb|AAVX02067420.1|         66       <NA>   <NA>
 > idx = c("gi|564982701|gb|KI635856.1|", "gi|564982688|gb|KI635859.1|")
 > which = as(seqinfo(fafile)[idx], "GRanges")
 > getSeq(fafile, which)
   A DNAStringSet instance of length 2
        width seq                                            names
[1] 17031706 TATAACTGGAGTGTATGTATAC...TGTACCGCCCGGGGTGGTGCG gi|564982701|gb|K...
[2] 15003573 AGAGAGAGATAGAGAGAGACAG...TTGATCATGTCAACCCCCCCA gi|564982688|gb|K...

There is also the vignette on creating a BSgenome package

http://bioconductor.org/packages/release/bioc/vignettes/BSgenome/inst/doc/BSgenomeForge.pdf

Martin

> Thank you
>
>   -- output of sessionInfo():
>
> R version 2.14.1 (2011-12-22)
> Platform: i686-pc-linux-gnu (32-bit)
>
> locale:
>   [1] LC_CTYPE=es_ES.UTF-8       LC_NUMERIC=C
>   [3] LC_TIME=es_ES.UTF-8        LC_COLLATE=es_ES.UTF-8
>   [5] LC_MONETARY=es_ES.UTF-8    LC_MESSAGES=es_ES.UTF-8
>   [7] LC_PAPER=C                 LC_NAME=C
>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> --
> Sent via the guest posting facility at bioconductor.org.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>


-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioconductor mailing list