[BioC] forge BSgenome data package

Thu Jan 20 00:43:23 CET 2011

Hi Steve,

On 01/19/2011 01:04 PM, Steve Shen wrote:
> Hi Herve,
>
> Thank you so much for your quick reply. Your vignette is pretty clear. I
> think the problem is on my side. I just lack of experience on dealing
> with this issue and maybe the file format doesn't fit. I sort of gave up
> using the seed file but just using low level commands. I still have
> errors. I really have no idea what is going on. Your help will be much
> appreciated.
>
> Best,
> Steve
>
> 1. with seed file, I got
>  > forgeBSgenomeDataPkg("./cflo_seed.R")
> Error in as.list(.readSeedFile(x, verbose = verbose)) :
>    error in evaluating the argument 'x' in selecting a method for
> function 'as.list'

I agree that the error message is kind of obscure but the problem
is probably that the specified path to your seed file is invalid.
I'm about to commit a change to the BSgenomeForge code that will
produce this error message instead:

 > forgeBSgenomeDataPkg("aaaaa")
Error in .readSeedFile(x, verbose = verbose) :
   seed file 'aaaaa' not found

An easy way to make sure the path is valid is to press <TAB> when
you are in the middle of typing the path: this will trigger the
auto-completion feature.

>
> 2. with command forgeSeqlengthsFile,
>  > forgeSeqlengthsFile("cflo_v3.3.fold", prefix="",
> suffix=".fa",seqs_srcdir=".", seqs_destdir=".", verbose=TRUE)
> Saving 'seqlengths' object to compressed data file './seqlengths.rda'...
> DONE
> Warning messages:
> 1: In FUN("cflo_v3.3.fold"[[1L]], ...) :
>    In file './cflo_v3.3.fold.fa': 24026 sequences found, using first
> sequence only
> 2: In FUN("cflo_v3.3.fold"[[1L]], ...) :
>    In file './cflo_v3.3.fold.fa': sequence description
> "scaffold9scaffold12scaffold16scaffold2scaffold4scaffold20scaffold10scaffold11scaffold24scaffold1scaffold3scaffold6scaffold15scaffold28scaffold30scaffold22scaffold18scaffold21scaffold7scaffold32scaffold36scaffold13scaffold23scaffold39scaffold19scaffold41scaffold29scaffold37scaffold33scaffold45scaffold38scaffold17scaffold5scaffold46scaffold48scaffold31scaffold25scaffold51scaffold49scaffold44scaffold47scaffold54scaffold43scaffold55scaffold60scaffold35scaffold42scaffold63scaffold53scaffold50scaffold40scaffold67scaffold61scaffold69scaffold34scaffold70scaffold27scaffold73scaffold8scaffold75scaffold71scaffold74scaffold14scaffold66scaffold65scaffold80scaffold62scaffold76scaffold79scaffold85scaffold78scaffold86scaffold88scaffold81scaffold87scaffold26scaffold92scaffold93scaffold94scaffold59scaffold52scaffold84scaffold77scaffold89scaffold100scaffold97scaffold99scaffold64scaffold103scaffold90scaffold106scaffold56scaffold98scaffold109scaffold68sc
> [... truncated]
>
> 3. with forgeSeqfiles,
>  > forgeSeqFiles("cflo_v3.3.fold", mseqnames=NULL, prefix="",
> suffix=".fa",seqs_srcdir=".", seqs_destdir="./BSgenome", verbose=TRUE)
> Loading FASTA file './cflo_v3.3.fold.fa' in 'cflo_v3.3.fold' object... DONE
> Saving 'cflo_v3.3.fold' object to compressed data file
> './BSgenome/cflo_v3.3.fold.rda'... DONE
> Warning message:
> In .forgeSeqFile(name, prefix, suffix, seqs_srcdir, seqs_destdir,  :
>    file contains 24026 sequences, using the first sequence only

Sorry but I can only try to help if you stick to the vignette and
use a seed file. Using low-level functions like forgeSeqlengthsFile()
or forgeSeqFiles() is undocumented/unsupported. The reason for this
is that this route is *much* more complicated and error-prone
than using a seed file! This is exactly why seed files where
invented: to make the whole process much easier for you, and to
make troubleshooting much easier for us.

Cheers,
H.

>
>
>
> 2011/1/19 Hervé Pagès <hpages at fhcrc.org <mailto:hpages at fhcrc.org>>
>
>     Hi Steve,
>
>     There are several things that look wrong with your seed file.
>
>     1. The # must be the first character in a line to make it a
>        line of comment (ignored).
>        For example, those 2 lines will certainly not be interpreted
>        as you might expect:
>
>
>          mseqnames: NA #paste("upstream", c("1000", "2000", "5000"), sep="")
>
>          source_url: NA #
>     http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/
>
>     2. If you don't have multiple sequences, specify:
>
>          mseqnames: character(0)
>
>     3. As explained in the BSgenomeForge vignette, the default for
>        seqfiles_prefix is .fa so this won't work:
>
>          seqnames: "clof_v3.fa"
>
>        Unless your file is named clof_v3.fa.fa?
>        If your file is named clof_v3.fa, then you should specify:
>
>          seqnames: "clof_v3"
>
>     4. This doesn't look like a valid file of assembly gaps:
>
>
>          AGAPSfiles_name: clof_v3.fa.masked
>
>        Please refer to the BSgenomeForge vignette for what kind of masks
>        and what file formats are supported.
>
>     5. Having this
>
>
>          PkgExamples: Hsapiens
>                 seqlengths(Hsapiens)
>                 Hsapiens$chr1  # same as Hsapiens[["chr1"]]
>
>        in a seed file for clof obviously doesn't make sense and your
>        package won't pass 'R CMD check' because it will contain broken
>        examples.
>
>     There might be other problems with your seed file. All you need to do
>     is read and follow carefully the instructions described in the
>     BSgenomeForge vignette. Let me know if things are not clear in the
>     vignette and I'll try to improve it. Thanks!
>
>     H.
>
>
>
>     On 01/18/2011 05:32 PM, Steve Shen wrote:
>
>         Hi list,
>
>         This wasn't sent out with a none registered ids. I have a
>         problem with
>         forging a new BSgenome data package. The sequence data file is
>         clof.v3.fa
>         and the mask file is clof.v3.fa.masked. Below are seed file,
>         command, error
>         and sessioninfo. Your help will be much appreciated.
>
>         Thanks a lot,
>         Steve
>
>         Package: BSgenome.Clof.yu.v3
>         Title: Clof (insects) full genome (version 3)
>         Description: Clof (insects) full genome as provided by YU (V3.3,
>         Jan. 2011)
>         # and will store in Biostrings objects.
>         Version: 1.0.0
>         organism: clof
>         species: insect
>         provider: YU
>         provider_version: Assembly V3.3
>         release_date: Jan, 2011
>         release_name: insects Genome Reference Consortium
>         source_url: NA #
>         http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/
>         organism_biocview: Clof
>         BSgenomeObjname: Clof
>         seqnames: "clof_v3.fa"
>         mseqnames: NA #paste("upstream", c("1000", "2000", "5000"), sep="")
>         nmask_per_seq: 2
>         #SrcDataFiles1: sequences: chromFa.zip, upstream1000.zip,
>         upstream2000.zip,
>         upstream5000.zip
>              #from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/
>         #SrcDataFiles2: AGAPS masks:
>         http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/gap.txt.gz
>              #RM masks:
>         http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromOut.tar.gz
>              #TRF masks:
>         http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromTrf.tar.gz
>         PkgExamples: Hsapiens
>              seqlengths(Hsapiens)
>              Hsapiens$chr1  # same as Hsapiens[["chr1"]]
>         seqs_srcdir: /home/steve/Data/Genomes
>         masks_srcdir: /home/steve/Data/Genomes
>         AGAPSfiles_name: clof_v3.fa.masked
>
>         The command, error and sessionInfo are below
>
>             forgeBSgenomeDataPkg("Clof_seed.R", seqs_srcdir=".",
>             masks_srcdir=".",
>
>         destdir=".", verbose=TRUE)
>         Loading required package: Biobase
>
>         Welcome to Bioconductor
>
>            Vignettes contain introductory material. To view, type
>         'openVignette()'. To cite Bioconductor, see
>         'citation("Biobase")' and for packages 'citation(pkgname)'.
>
>
>         Attaching package: 'Biobase'
>
>         The following object(s) are masked from 'package:IRanges':
>
>              updateObject
>
>         Error in forgeBSgenomeDataPkg(y, seqs_srcdir = seqs_srcdir,
>         masks_srcdir =
>         masks_srcdir,  :
>            values for symbols NMASKPERSEQ are not single strings
>         In addition: Warning message:
>         In storage.mode(x$nmask_per_seq)<- "integer" : NAs introduced by
>         coercion
>
>             sessionInfo()
>
>         R version 2.12.1 (2010-12-16)
>         Platform: x86_64-pc-linux-gnu (64-bit)
>
>         locale:
>           [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>           [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>           [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>           [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>           [9] LC_ADDRESS=C               LC_TELEPHONE=C
>         [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
>         attached base packages:
>         [1] stats     graphics  grDevices utils     datasets  methods   base
>
>         other attached packages:
>         [1] Biobase_2.6.1       BSgenome_1.18.2     Biostrings_2.18.0
>         [4] GenomicRanges_1.2.1 IRanges_1.8.8
>
>         loaded via a namespace (and not attached):
>         [1] tools_2.12.1
>
>                 [[alternative HTML version deleted]]
>
>         _______________________________________________
>         Bioconductor mailing list
>         Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>         https://stat.ethz.ch/mailman/listinfo/bioconductor
>         Search the archives:
>         http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
>
>     --
>     Hervé Pagès
>
>     Program in Computational Biology
>     Division of Public Health Sciences
>     Fred Hutchinson Cancer Research Center
>     1100 Fairview Ave. N, M2-B876
>     P.O. Box 19024
>     Seattle, WA 98109-1024
>
>     E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>     Phone:  (206) 667-5791
>     Fax:    (206) 667-1319
>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319