[BioC] forge BSgenome data package
Hervé Pagès
hpages at fhcrc.org
Thu Jan 20 00:43:23 CET 2011
Hi Steve,
On 01/19/2011 01:04 PM, Steve Shen wrote:
> Hi Herve,
>
> Thank you so much for your quick reply. Your vignette is pretty clear. I
> think the problem is on my side. I just lack of experience on dealing
> with this issue and maybe the file format doesn't fit. I sort of gave up
> using the seed file but just using low level commands. I still have
> errors. I really have no idea what is going on. Your help will be much
> appreciated.
>
> Best,
> Steve
>
> 1. with seed file, I got
> > forgeBSgenomeDataPkg("./cflo_seed.R")
> Error in as.list(.readSeedFile(x, verbose = verbose)) :
> error in evaluating the argument 'x' in selecting a method for
> function 'as.list'
I agree that the error message is kind of obscure but the problem
is probably that the specified path to your seed file is invalid.
I'm about to commit a change to the BSgenomeForge code that will
produce this error message instead:
> forgeBSgenomeDataPkg("aaaaa")
Error in .readSeedFile(x, verbose = verbose) :
seed file 'aaaaa' not found
An easy way to make sure the path is valid is to press <TAB> when
you are in the middle of typing the path: this will trigger the
auto-completion feature.
>
> 2. with command forgeSeqlengthsFile,
> > forgeSeqlengthsFile("cflo_v3.3.fold", prefix="",
> suffix=".fa",seqs_srcdir=".", seqs_destdir=".", verbose=TRUE)
> Saving 'seqlengths' object to compressed data file './seqlengths.rda'...
> DONE
> Warning messages:
> 1: In FUN("cflo_v3.3.fold"[[1L]], ...) :
> In file './cflo_v3.3.fold.fa': 24026 sequences found, using first
> sequence only
> 2: In FUN("cflo_v3.3.fold"[[1L]], ...) :
> In file './cflo_v3.3.fold.fa': sequence description
> "scaffold9scaffold12scaffold16scaffold2scaffold4scaffold20scaffold10scaffold11scaffold24scaffold1scaffold3scaffold6scaffold15scaffold28scaffold30scaffold22scaffold18scaffold21scaffold7scaffold32scaffold36scaffold13scaffold23scaffold39scaffold19scaffold41scaffold29scaffold37scaffold33scaffold45scaffold38scaffold17scaffold5scaffold46scaffold48scaffold31scaffold25scaffold51scaffold49scaffold44scaffold47scaffold54scaffold43scaffold55scaffold60scaffold35scaffold42scaffold63scaffold53scaffold50scaffold40scaffold67scaffold61scaffold69scaffold34scaffold70scaffold27scaffold73scaffold8scaffold75scaffold71scaffold74scaffold14scaffold66scaffold65scaffold80scaffold62scaffold76scaffold79scaffold85scaffold78scaffold86scaffold88scaffold81scaffold87scaffold26scaffold92scaffold93scaffold94scaffold59scaffold52scaffold84scaffold77scaffold89scaffold100scaffold97scaffold99scaffold64scaffold103scaffold90scaffold106scaffold56scaffold98scaffold109scaffold68sc
> [... truncated]
>
> 3. with forgeSeqfiles,
> > forgeSeqFiles("cflo_v3.3.fold", mseqnames=NULL, prefix="",
> suffix=".fa",seqs_srcdir=".", seqs_destdir="./BSgenome", verbose=TRUE)
> Loading FASTA file './cflo_v3.3.fold.fa' in 'cflo_v3.3.fold' object... DONE
> Saving 'cflo_v3.3.fold' object to compressed data file
> './BSgenome/cflo_v3.3.fold.rda'... DONE
> Warning message:
> In .forgeSeqFile(name, prefix, suffix, seqs_srcdir, seqs_destdir, :
> file contains 24026 sequences, using the first sequence only
Sorry but I can only try to help if you stick to the vignette and
use a seed file. Using low-level functions like forgeSeqlengthsFile()
or forgeSeqFiles() is undocumented/unsupported. The reason for this
is that this route is *much* more complicated and error-prone
than using a seed file! This is exactly why seed files where
invented: to make the whole process much easier for you, and to
make troubleshooting much easier for us.
Cheers,
H.
>
>
>
> 2011/1/19 Hervé Pagès <hpages at fhcrc.org <mailto:hpages at fhcrc.org>>
>
> Hi Steve,
>
> There are several things that look wrong with your seed file.
>
> 1. The # must be the first character in a line to make it a
> line of comment (ignored).
> For example, those 2 lines will certainly not be interpreted
> as you might expect:
>
>
> mseqnames: NA #paste("upstream", c("1000", "2000", "5000"), sep="")
>
> source_url: NA #
> http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/
>
> 2. If you don't have multiple sequences, specify:
>
> mseqnames: character(0)
>
> 3. As explained in the BSgenomeForge vignette, the default for
> seqfiles_prefix is .fa so this won't work:
>
> seqnames: "clof_v3.fa"
>
> Unless your file is named clof_v3.fa.fa?
> If your file is named clof_v3.fa, then you should specify:
>
> seqnames: "clof_v3"
>
> 4. This doesn't look like a valid file of assembly gaps:
>
>
> AGAPSfiles_name: clof_v3.fa.masked
>
> Please refer to the BSgenomeForge vignette for what kind of masks
> and what file formats are supported.
>
> 5. Having this
>
>
> PkgExamples: Hsapiens
> seqlengths(Hsapiens)
> Hsapiens$chr1 # same as Hsapiens[["chr1"]]
>
> in a seed file for clof obviously doesn't make sense and your
> package won't pass 'R CMD check' because it will contain broken
> examples.
>
> There might be other problems with your seed file. All you need to do
> is read and follow carefully the instructions described in the
> BSgenomeForge vignette. Let me know if things are not clear in the
> vignette and I'll try to improve it. Thanks!
>
> H.
>
>
>
> On 01/18/2011 05:32 PM, Steve Shen wrote:
>
> Hi list,
>
> This wasn't sent out with a none registered ids. I have a
> problem with
> forging a new BSgenome data package. The sequence data file is
> clof.v3.fa
> and the mask file is clof.v3.fa.masked. Below are seed file,
> command, error
> and sessioninfo. Your help will be much appreciated.
>
> Thanks a lot,
> Steve
>
> Package: BSgenome.Clof.yu.v3
> Title: Clof (insects) full genome (version 3)
> Description: Clof (insects) full genome as provided by YU (V3.3,
> Jan. 2011)
> # and will store in Biostrings objects.
> Version: 1.0.0
> organism: clof
> species: insect
> provider: YU
> provider_version: Assembly V3.3
> release_date: Jan, 2011
> release_name: insects Genome Reference Consortium
> source_url: NA #
> http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/
> organism_biocview: Clof
> BSgenomeObjname: Clof
> seqnames: "clof_v3.fa"
> mseqnames: NA #paste("upstream", c("1000", "2000", "5000"), sep="")
> nmask_per_seq: 2
> #SrcDataFiles1: sequences: chromFa.zip, upstream1000.zip,
> upstream2000.zip,
> upstream5000.zip
> #from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/
> #SrcDataFiles2: AGAPS masks:
> http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/gap.txt.gz
> #RM masks:
> http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromOut.tar.gz
> #TRF masks:
> http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromTrf.tar.gz
> PkgExamples: Hsapiens
> seqlengths(Hsapiens)
> Hsapiens$chr1 # same as Hsapiens[["chr1"]]
> seqs_srcdir: /home/steve/Data/Genomes
> masks_srcdir: /home/steve/Data/Genomes
> AGAPSfiles_name: clof_v3.fa.masked
>
> The command, error and sessionInfo are below
>
> forgeBSgenomeDataPkg("Clof_seed.R", seqs_srcdir=".",
> masks_srcdir=".",
>
> destdir=".", verbose=TRUE)
> Loading required package: Biobase
>
> Welcome to Bioconductor
>
> Vignettes contain introductory material. To view, type
> 'openVignette()'. To cite Bioconductor, see
> 'citation("Biobase")' and for packages 'citation(pkgname)'.
>
>
> Attaching package: 'Biobase'
>
> The following object(s) are masked from 'package:IRanges':
>
> updateObject
>
> Error in forgeBSgenomeDataPkg(y, seqs_srcdir = seqs_srcdir,
> masks_srcdir =
> masks_srcdir, :
> values for symbols NMASKPERSEQ are not single strings
> In addition: Warning message:
> In storage.mode(x$nmask_per_seq)<- "integer" : NAs introduced by
> coercion
>
> sessionInfo()
>
> R version 2.12.1 (2010-12-16)
> Platform: x86_64-pc-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] Biobase_2.6.1 BSgenome_1.18.2 Biostrings_2.18.0
> [4] GenomicRanges_1.2.1 IRanges_1.8.8
>
> loaded via a namespace (and not attached):
> [1] tools_2.12.1
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M2-B876
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
> Phone: (206) 667-5791
> Fax: (206) 667-1319
>
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list