[BioC] forge BSgenome data package

Hervé Pagès hpages at fhcrc.org
Wed Jan 19 07:24:54 CET 2011


Hi Steve,

There are several things that look wrong with your seed file.

1. The # must be the first character in a line to make it a
    line of comment (ignored).
    For example, those 2 lines will certainly not be interpreted
    as you might expect:

      mseqnames: NA #paste("upstream", c("1000", "2000", "5000"), sep="")

      source_url: NA # 
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/

2. If you don't have multiple sequences, specify:

      mseqnames: character(0)

3. As explained in the BSgenomeForge vignette, the default for
    seqfiles_prefix is .fa so this won't work:

      seqnames: "clof_v3.fa"

    Unless your file is named clof_v3.fa.fa?
    If your file is named clof_v3.fa, then you should specify:

      seqnames: "clof_v3"

4. This doesn't look like a valid file of assembly gaps:

      AGAPSfiles_name: clof_v3.fa.masked

    Please refer to the BSgenomeForge vignette for what kind of masks
    and what file formats are supported.

5. Having this

      PkgExamples: Hsapiens
             seqlengths(Hsapiens)
             Hsapiens$chr1  # same as Hsapiens[["chr1"]]

    in a seed file for clof obviously doesn't make sense and your
    package won't pass 'R CMD check' because it will contain broken
    examples.

There might be other problems with your seed file. All you need to do
is read and follow carefully the instructions described in the
BSgenomeForge vignette. Let me know if things are not clear in the
vignette and I'll try to improve it. Thanks!

H.


On 01/18/2011 05:32 PM, Steve Shen wrote:
> Hi list,
>
> This wasn't sent out with a none registered ids. I have a problem with
> forging a new BSgenome data package. The sequence data file is clof.v3.fa
> and the mask file is clof.v3.fa.masked. Below are seed file, command, error
> and sessioninfo. Your help will be much appreciated.
>
> Thanks a lot,
> Steve
>
> Package: BSgenome.Clof.yu.v3
> Title: Clof (insects) full genome (version 3)
> Description: Clof (insects) full genome as provided by YU (V3.3, Jan. 2011)
> # and will store in Biostrings objects.
> Version: 1.0.0
> organism: clof
> species: insect
> provider: YU
> provider_version: Assembly V3.3
> release_date: Jan, 2011
> release_name: insects Genome Reference Consortium
> source_url: NA # http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/
> organism_biocview: Clof
> BSgenomeObjname: Clof
> seqnames: "clof_v3.fa"
> mseqnames: NA #paste("upstream", c("1000", "2000", "5000"), sep="")
> nmask_per_seq: 2
> #SrcDataFiles1: sequences: chromFa.zip, upstream1000.zip, upstream2000.zip,
> upstream5000.zip
>      #from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/
> #SrcDataFiles2: AGAPS masks:
> http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/gap.txt.gz
>      #RM masks:
> http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromOut.tar.gz
>      #TRF masks:
> http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromTrf.tar.gz
> PkgExamples: Hsapiens
>      seqlengths(Hsapiens)
>      Hsapiens$chr1  # same as Hsapiens[["chr1"]]
> seqs_srcdir: /home/steve/Data/Genomes
> masks_srcdir: /home/steve/Data/Genomes
> AGAPSfiles_name: clof_v3.fa.masked
>
> The command, error and sessionInfo are below
>
>> forgeBSgenomeDataPkg("Clof_seed.R", seqs_srcdir=".", masks_srcdir=".",
> destdir=".", verbose=TRUE)
> Loading required package: Biobase
>
> Welcome to Bioconductor
>
>    Vignettes contain introductory material. To view, type
>    'openVignette()'. To cite Bioconductor, see
>    'citation("Biobase")' and for packages 'citation(pkgname)'.
>
>
> Attaching package: 'Biobase'
>
> The following object(s) are masked from 'package:IRanges':
>
>      updateObject
>
> Error in forgeBSgenomeDataPkg(y, seqs_srcdir = seqs_srcdir, masks_srcdir =
> masks_srcdir,  :
>    values for symbols NMASKPERSEQ are not single strings
> In addition: Warning message:
> In storage.mode(x$nmask_per_seq)<- "integer" : NAs introduced by coercion
>> sessionInfo()
> R version 2.12.1 (2010-12-16)
> Platform: x86_64-pc-linux-gnu (64-bit)
>
> locale:
>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>   [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>   [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] Biobase_2.6.1       BSgenome_1.18.2     Biostrings_2.18.0
> [4] GenomicRanges_1.2.1 IRanges_1.8.8
>
> loaded via a namespace (and not attached):
> [1] tools_2.12.1
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor


-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list