[BioC] Question about MEDIPS extend parameter : RE: Success: RE: Seed file error: RE: BioStrings for current R: RE: BSgenomeForge seed file - seqnames field
Matthias Lienhard
lienhard at molgen.mpg.de
Tue Dec 17 00:17:21 CET 2013
Hi Kelly,
yes, the MEDIPS extend parameter is supposed to be the size of the
fragments that have been IP'd.
Best, Matthias
Am 16.12.2013 15:03, schrieb Hervé Pagès:
> Hi Kelly,
>
> I'm afraid you're asking the wrong person, sorry. The best place to ask
> this kind of question is on the mailing list (cc'ed) where there is a
> good chance that someone will be able to help you.
>
> Cheers,
> H.
>
>
> On 12/16/2013 02:55 PM, Vining, Kelly wrote:
>> Hi Herve,
>> I have a quick question about the 'extend' vs. 'shift' parameters in
>> the MEDIPS package. From the documentation:
>>
>> All reads will be extended to a length of 300nt according to the
>> given strandinformation:
>>> extend=300
>> As an alternative to the extend parameter, the
>> shift parameter can be specifed. Here, the reads are not extended but
>> shifted by the specifed number
>> of nucleotides with respect to the given strand infomation. One of
>> the two parameters extend or
>> shift has to be 0.
>>
>> Is this parameter supposed to represent the IP'd fragment size?
>>
>> Thanks,
>> --Kelly
>>
>> -----Original Message-----
>> From: Vining, Kelly
>> Sent: Thursday, December 12, 2013 6:51 AM
>> To: Hervé Pagès
>> Subject: Success: RE: Seed file error: RE: BioStrings for current R:
>> RE: [BioC] BSgenomeForge seed file - seqnames field
>>
>> Hi Herve,
>> I have now succeeded in forging the poplar BSgenome object. I will
>> have to use it in order to confirm that it's exactly as I expect, but
>> just wanted to let you know I got this far and thank you again for
>> your help.
>>
>> Cheers,
>> --Kelly
>> ________________________________________
>> From: Hervé Pagès [hpages at fhcrc.org]
>> Sent: Tuesday, December 10, 2013 2:38 PM
>> To: Vining, Kelly
>> Cc: bioconductor at r-project.org
>> Subject: Re: Seed file error: RE: BioStrings for current R: RE:
>> [BioC] BSgenomeForge seed file - seqnames field
>>
>> Hi Kelly,
>>
>> Hard to tell without knowing exactly what you've downloaded but the
>> 1st thing that strikes me when I look at your seed file below is
>> that, depending on where I look exactly, the Populus trichocarpa
>> genome seems to have either 12 chromosomes (based on your seqnames
>> field), or 13 chromosomes (based on your SrcDataFiles1 field), or 19
>> chromosomes (based on the error reported by forgeBSgenomeDataPkg()).
>> You need to figure out how many chromosomes there are and stick to that.
>>
>> A few other things:
>>
>> - No ".fa" extensions in the seqnames.
>>
>> - Make sure you have one .fa file per chromosome. Each file should be
>> named Chr01.fa, Chr02.fa, Chr03.fa, etc... Note that if you've
>> downloaded Ptrichocarpa_210.fa.gz from
>>
>> ftp://ftp.jgi-psf.org/pub/compgen/phytozome/v9.0/Ptrichocarpa/assembly/
>>
>> this file is a single FASTA file that contains all the chromosome
>> sequences so you need to split it first in order to have one FASTA
>> file per chromosome. See the splitbigfasta.R script in
>> BSgenome/inst/extdata/GentlemanLab/BSgenome.Btaurus.UCSC.bosTau6-tools
>> for how to do this (you will need to adapt the script to your
>> situation). I realize this splitting step is a burden and I have on
>> my list to improve forgeBSgenomeDataPkg() so it won't be needed
>> anymore.
>>
>> - Finally make sure the Chr01.fa, Chr02.fa, Chr03.fa, etc... files
>> are in the folder specified in the seqs_srcdir field
>> (/raid1/home/pi/viningk/Poplar/genomedb in your case).
>>
>> Let me know if that still doesn't work for you.
>>
>> Cheers,
>> H.
>>
>>
>> On 12/10/2013 12:56 PM, Vining, Kelly wrote:
>>> Hi Herve,
>>> I'm almost there with my BSgenome package, but am still getting an
>>> error. It's clearly due to my seed file format, but I am not sure
>>> what to change at this point. It appears that BSgenome is not seeing
>>> all of my fasta files as separate. I tried not having the .fa
>>> suffixes in the seqnames field, but got a similar "files not found"
>>> error.
>>>
>>> Thanks again for your help.
>>>
>>> Error:
>>>> forgeBSgenomeDataPkg("BSgenome.Ptrichocarpa.Phytozome.v3-seed")
>>> Creating package in ./BSgenome.Ptrichocarpa.Phytozome.v3
>>> Error in getSeqSrcpaths(seqnames, prefix = prefix, suffix = suffix,
>>> seqs_srcdir = seqs_srcdir) :
>>> /raid1/home/pi/viningk/Poplar/genomedb/Chr01.fa Chr02.fa Chr03.fa
>>> Chr04.fa Chr05.fa Chr06.fa Chr07.fa Chr08.fa Chr09.fa Chr10.fa
>>> Chr11.fa Chr12.fa Chr13.fa Chr14.fa Chr15.fa Chr16.fa Chr17.fa
>>> Chr18.fa Chr19.fa.fa: file(s) not found
>>>
>>> My seed file:
>>>
>>> Package: BSgenome.Ptrichocarpa.Phytozome.v3
>>> Title: Populus trichocarpa full genome (Phytozome version 3)
>>> Description: Populus trichocarpa (Black cottonwood) genome as provided
>>> by Phytozome (v3, 2013) stored in Bioconductor
>>> Version: 3.0
>>> organism: Populus trichocarpa
>>> species: Black cottonwood
>>> provider: Phytozome (JGI)
>>> provider_version: 3.0
>>> release_date: January 2010
>>> release_name: Populus trichocarpa v3.0
>>> source_url:
>>> ftp://ftp.jgi-psf.org/pub/compgen/phytozome/v9.0/Ptrichocarpa/
>>> organism_biocview: Populus_trichocarpa
>>> BSgenomeObjname: Ptrichocarpa
>>> seqnames: paste("Chr01.fa", "Chr02.fa", "Chr03.fa", "Chr04.fa",
>>> "Chr05.fa", "Chr06.fa", "Chr07.fa", "Chr08.fa", "Chr09.fa",
>>> "Chr10.fa", "Chr11.fa", "Chr12.fa",$
>>> mseqnames: names(scaf_mseq)
>>> SrcDataFiles1: sequences: Chr01.fa, Chr02.fa, Chr03.fa, Chr04.fa,
>>> Chr05.fa, Chr06.fa, Chr07.fa, Chr08.fa, Chr09.fa, Chr10.fa, Chr11.fa,
>>> Chr12.fa, Chr13.fa, Chr$
>>> PkgExamples: genome$Chr01 # same as genome[["Chr01"]]
>>> seqs_srcdir: /raid1/home/pi/viningk/Poplar/genomedb
>>>
>>>
>>> ________________________________________
>>> From: Hervé Pagès [hpages at fhcrc.org]
>>> Sent: Monday, December 09, 2013 2:09 PM
>>> To: Vining, Kelly
>>> Cc: bioconductor at r-project.org
>>> Subject: Re: BioStrings for current R: RE: [BioC] BSgenomeForge seed
>>> file - seqnames field
>>>
>>> Hi Kelly,
>>>
>>> On 12/06/2013 12:23 PM, Vining, Kelly wrote:
>>>> Hi Herve,
>>>> It has been a while since I have worked with Biostrings and the
>>>> MEDIPS package. When I went to install Biostrings just now, got an
>>>> error message that Biostrings isn't available:
>>>>
>>>>> install.packages("Biostrings")
>>>> Installing package into â/raid1/home/pi/viningk/Râ (as âlibâ is
>>>> unspecified)
>>>> --- Please select a CRAN mirror for use in this session ---
>>>
>>> Here you are asked to choose a CRAN mirror but Biostrings is not a
>>> CRAN package. This suggests that you might be on the wrong path.
>>>
>>>
>>>> Warning message:
>>>> package âBiostringsâ is not available (for R version 3.0.0)
>>>>
>>>> Revisiting the online documentation, it does look like the package
>>>> is still available.
>>>> Is there an available update, or a different alternative I should
>>>> be using?
>>>
>>> The only proper way to install a Bioconductor package is to use
>>> biocLite(), as documented here:
>>>
>>> http://bioconductor.org/install/
>>>
>>> Unlike install.packages(), biocLite() "knows" where to find
>>> Bioconductor packages. Also make sure you're using the latest BioC
>>> release (BioC 2.13).
>>>
>>> Cheers,
>>> H.
>>>
>>>>
>>>> Thanks,
>>>> --Kelly Vining
>>>> ________________________________________
>>>> From: Hervé Pagès [hpages at fhcrc.org]
>>>> Sent: Thursday, May 02, 2013 10:49 AM
>>>> To: Vining, Kelly
>>>> Cc: Kelly V [guest]; bioconductor at r-project.org; MEDIPS Maintainer
>>>> Subject: Re: [BioC] BSgenomeForge seed file - seqnames field
>>>>
>>>> Hi Kelly,
>>>>
>>>> On 05/02/2013 06:44 AM, Vining, Kelly wrote:
>>>>> Hi Herve,
>>>>> Thanks for your helpful response, and for pointing me to the
>>>>> proper scripts. Given this, it appears that I should treat my gene
>>>>> feature annotation files (promoters, genes) as mseqnames objects,
>>>>> similarly to how I will treat the extrachromosomal scaffolds. Is
>>>>> there any way to include additional features that I may find of
>>>>> interest after I forge this initial package (say, using gff
>>>>> files), or am I limited such that I would have to re-forge a new
>>>>> reference?
>>>>
>>>> A BSgenome data package is for storing the sequences of a given
>>>> genome/assembly. No annotations should go there.
>>>>
>>>> In Bioconductor we try to maintain a clear separation between
>>>> packages that contain the sequences of a genome/assembly (BSgenome
>>>> data packages), and packages that contain annotations for that
>>>> genome/assembly (TxDb, FDb, OrganismDb, SNPlocs, etc... packages).
>>>>
>>>> There are at least 2 reasons for this:
>>>>
>>>> 1. For a given assembly there are many kinds of annotations
>>>> available
>>>> in many places on the internet, and some of them tend to
>>>> be updated
>>>> frequently. By contrast, the sequences of a given assembly
>>>> never change. So putting only the reference sequences in a
>>>> BSgenome package make it very stable. It almost never
>>>> needs to
>>>> be updated, except for updating a man page or when something
>>>> changes in the way sequences are stored in the package.
>>>> This is
>>>> good with such big packages (800 Mb for the biggest ones).
>>>>
>>>> 2. Storing the DNA sequences of the genes, promoters, or
>>>> other genomic
>>>> feature in a BSgenome package would duplicate a lot of DNA
>>>> sequences that are already in the reference sequences (those
>>>> small sequences are substrings of the reference sequences).
>>>> This would make the BSgenome package unnecessarily bigger.
>>>> It's better to store only the locations of those genomic
>>>> features,
>>>> not in the BSgenome package, but in a separate package,
>>>> and then
>>>> to use those locations to extract the corresponding
>>>> sequences from
>>>> the BSgenome object. This extraction can be done with
>>>> getSeq()
>>>> (defined in the BSgenome software package) and is relatively
>>>> cheap. This approach gives you a lot more flexibility than
>>>> having
>>>> those sequences pre-extracted and bundled with the
>>>> BSgenome data
>>>> package.
>>>>
>>>> If your annotations are in GFF files, you don't really
>>>> need to
>>>> put them in a package: you can load them directly in GRanges
>>>> objects with import() (defined in the rtracklayer package),
>>>> or if you are particularly interested in the exon/transcript
>>>> structure of your genes, you could use
>>>> makeTranscriptDbFromGFF()
>>>> to load a GFF file containing genes, mRNAs, exons, and
>>>> CDSs into
>>>> a TranscriptDb. See ?makeTranscriptDbFromGFF in the
>>>> GenomicFeatures package for more information. Then it's
>>>> easy to
>>>> extract the locations of transcripts, exons, or CDSs as a
>>>> GRanges
>>>> or GRangesList object from this TranscriptDb object. See
>>>> ?transcripts and ?exonsBy. Finally once you have the
>>>> locations
>>>> in a GRanges object, you can use getSeq() to extract the
>>>> corresponding sequences from the BSgenome object.
>>>>
>>>> Other more specialized functions are promoters() and
>>>> getPromoterSeq() for extracting the promoter locations and
>>>> their sequences, respectively. Also
>>>> extractTranscriptsFromGenome()
>>>> for extracting the full transcriptome from a BSgenome
>>>> package.
>>>>
>>>> HTH,
>>>> H.
>>>>
>>>> PS: The upstream sequences that you see in some of our BSgenome data
>>>> packages are a relic of the past and will be removed soon.
>>>>
>>>>>
>>>>> Thanks much,
>>>>> --Kelly V.
>>>>>
>>>>> -----Original Message-----
>>>>> From: Hervé Pagès [mailto:hpages at fhcrc.org]
>>>>> Sent: Tuesday, April 30, 2013 11:12 PM
>>>>> To: Kelly V [guest]
>>>>> Cc: bioconductor at r-project.org; Vining, Kelly; MEDIPS Maintainer
>>>>> Subject: Re: [BioC] BSgenomeForge seed file - seqnames field
>>>>>
>>>>> Hi Kelly,
>>>>>
>>>>> On 04/30/2013 03:52 PM, Kelly V [guest] wrote:
>>>>>>
>>>>>> I'm preparing a custom reference genome for use with the MEDIPS
>>>>>> package. I see that one field of the seed file, which is
>>>>>> apparently not optional, is the 'seqnames' field. The example
>>>>>> given in the documentation is this:
>>>>>>
>>>>>> paste("chr", c(1:20, "X", "M", "Un", paste(c(1:20, "X", "Un"),
>>>>>> "_random", sep="")), sep="")
>>>>>>
>>>>>> I have two simple questions about this.
>>>>>>
>>>>>> 1. Does R match this information with the source sequence file?
>>>>>> For example, if I have a single fasta file with fasta headers
>>>>>> chr_01...chr_20, must the seqnames entries exactly match those
>>>>>> headers?
>>>>>
>>>>> No. You need to provide 1 FASTA file per single sequence, that is,
>>>>> 1 file per name you put in the 'seqnames' field. That means that
>>>>> each file is expected to contain only 1 sequence. What's in the
>>>>> FASTA header of each file is not important. What's important is
>>>>> that the name of each file be of the form
>>>>> <prefix><seqname><suffix>, where <seqname> is the name of the
>>>>> sequence as it appears in the 'seqnames' field, and <prefix> and
>>>>> <suffix> are a prefix and a suffix (eventually empty) that are the
>>>>> same for all the files.
>>>>>
>>>>> If what you have is a big FASTA file containing all the chromosome
>>>>> sequences, then you first need to split it into 1 file per
>>>>> chromosome.
>>>>> This is easy to do in R. For example, here is the script I used to
>>>>> split the big bosTau6.fa file provided by UCSC for the bosTau6
>>>>> genome:
>>>>>
>>>>> library(Biostrings)
>>>>> bosTau6 <- readDNAStringSet("bosTau6.fa")
>>>>>
>>>>> ### Partitioning:
>>>>> is_chrUn <- grepl("^chrUn", names(bosTau6))
>>>>> is_chrom <- !is_chrUn
>>>>>
>>>>> ### Send each chromosome to a FASTA file.
>>>>> seqnames <- paste("chr", c(1:29, "X", "M"), sep="")
>>>>> stopifnot(setequal(seqnames, names(bosTau6)[is_chrom]))
>>>>>
>>>>> for (seqname in seqnames) {
>>>>> seq <- bosTau6[match(seqname, names(bosTau6))]
>>>>> filename <- paste(seqname, ".fa", sep="")
>>>>> cat("writing ", filename, "\n", sep="")
>>>>> writeXStringSet(seq, file=filename, width=50L)
>>>>> }
>>>>>
>>>>> ### Send the 3286 chrUn_* sequences to 1 FASTA file.
>>>>> chrUn_mseq <- bosTau6[is_chrUn]
>>>>> writeXStringSet(chrUn_mseq, file="chrUn.fa", width=50L)
>>>>>
>>>>> The input is the bosTau6.fa file containing 3317 FASTA records:
>>>>> 31 records for the chromosomes, and 3286 for the chrUn_* sequences.
>>>>> The script produces 32 FASTA files: 1 per chromosome (chr1.fa,
>>>>> chr2.fa, chr3.fa, ..., chr29.fa, chrX.fa, chrM.fa), and the
>>>>> chrUn.fa file (containing the 3286 chrUn_* sequences).
>>>>>
>>>>> Note that you can find this script in the BSgenome package, and
>>>>> display its source with:
>>>>>
>>>>> > splitbigfasta_R <- system.file("extdata",
>>>>> "GentlemanLab",
>>>>> "BSgenome.Btaurus.UCSC.bosTau6-tools",
>>>>> "splitbigfasta.R",
>>>>> package="BSgenome")
>>>>>
>>>>> > cat(readLines(splitbigfasta_R), sep="\n")
>>>>>
>>>>> It should not be too hard to adapt this script to your own needs.
>>>>>
>>>>>>
>>>>>> 2. Revealing the reason for my first question:In my genome fasta
>>>>>> file, I have 1427 extrachromosomal scaffolds, but they are not
>>>>>> all sequentially numbered, so that I have
>>>>>> scaffold_1..scaffold_3681. Do I need to use a regular expression
>>>>>> in my seqnames field to tell R to look for scaffold_ followed by
>>>>>> 1-4 digits?
>>>>>
>>>>> No, not in the 'seqnames' field, because those 1427
>>>>> extrachromosomal scaffolds should not go there. They would need to
>>>>> go in the 'mseqnames'
>>>>> field.
>>>>>
>>>>> Unlike the 'seqnames' field where you enumerate objects that can
>>>>> only contain 1 sequence, in the 'mseqnames' field you can enumerate
>>>>> objects that contain more than 1 sequence. More precisely, in the
>>>>> final BSgenome data package, each entry in the 'seqnames' field will
>>>>> correspond to a DNAString object (the DNAString container can hold
>>>>> 1 sequence only), and each entry in the 'mseqnames' field will
>>>>> correspond to a DNAStringSet object (the DNAStringSet container
>>>>> can hold multiple sequences).
>>>>>
>>>>> So typically, all the extrachromosomal scaffolds would go in one
>>>>> DNAStringSet object in the final BSgenome package (this is for
>>>>> example what is done in the BSgenome.Drerio.UCSC.danRer7 package).
>>>>> To achieve this, you need to put 1 entry in the 'mseqnames' field
>>>>> (e.g. "scaffolds"), and to put the 1427 extrachromosomal scaffolds
>>>>> in one FASTA file named accordingly to that entry (e.g.
>>>>> scaffolds.fa).
>>>>>
>>>>> In the above script, replace:
>>>>>
>>>>> is_chrUn <- grepl("^chrUn", names(bosTau6))
>>>>>
>>>>> with:
>>>>>
>>>>> is_scaffold <- grepl("^scaffold", names(<your_genome>))
>>>>>
>>>>> then every occurrence of 'is_chrUn' with 'is_scaffold', and
>>>>> finally those 3 lines:
>>>>>
>>>>> ### Send the 3286 chrUn_* sequences to 1 FASTA file.
>>>>> chrUn_mseq <- bosTau6[is_chrUn]
>>>>> writeXStringSet(chrUn_mseq, file="chrUn.fa", width=50L)
>>>>>
>>>>> with:
>>>>>
>>>>> ### Send the 1427 scaffold_* sequences to 1 FASTA file.
>>>>> scaffolds_mseq <- <your_genome>[is_scaffold]
>>>>> writeXStringSet(scaffolds_mseq, file="scaffolds.fa",
>>>>> width=50L)
>>>>>
>>>>> and that should take care of sending all the scaffold sequences to
>>>>> the scaffolds.fa file.
>>>>>
>>>>> Let me know if you need further assistance with this.
>>>>>
>>>>> Cheers,
>>>>> H.
>>>>>
>>>>>>
>>>>>> Thanks for any help,
>>>>>> --Kelly V.
>>>>>>
>>>>>> -- output of sessionInfo():
>>>>>>
>>>>>> R version 3.0.0 (2013-04-03)
>>>>>> Platform: i386-w64-mingw32/i386 (32-bit)
>>>>>>
>>>>>> locale:
>>>>>> [1] LC_COLLATE=English_United States.1252 [2]
>>>>>> LC_CTYPE=English_United
>>>>>> States.1252 [3] LC_MONETARY=English_United States.1252 [4]
>>>>>> LC_NUMERIC=C [5] LC_TIME=English_United States.1252
>>>>>>
>>>>>> attached base packages:
>>>>>> [1] stats graphics grDevices utils datasets methods base
>>>>>>
>>>>>> --
>>>>>> Sent via the guest posting facility at bioconductor.org.
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioconductor mailing list
>>>>>> Bioconductor at r-project.org
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>> Search the archives:
>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>
>>>>>
>>>>> --
>>>>> Hervé Pagès
>>>>>
>>>>> Program in Computational Biology
>>>>> Division of Public Health Sciences
>>>>> Fred Hutchinson Cancer Research Center
>>>>> 1100 Fairview Ave. N, M1-B514
>>>>> P.O. Box 19024
>>>>> Seattle, WA 98109-1024
>>>>>
>>>>> E-mail: hpages at fhcrc.org
>>>>> Phone: (206) 667-5791
>>>>> Fax: (206) 667-1319
>>>>>
>>>>
>>>> --
>>>> Hervé Pagès
>>>>
>>>> Program in Computational Biology
>>>> Division of Public Health Sciences
>>>> Fred Hutchinson Cancer Research Center
>>>> 1100 Fairview Ave. N, M1-B514
>>>> P.O. Box 19024
>>>> Seattle, WA 98109-1024
>>>>
>>>> E-mail: hpages at fhcrc.org
>>>> Phone: (206) 667-5791
>>>> Fax: (206) 667-1319
>>>>
>>>
>>> --
>>> Hervé Pagès
>>>
>>> Program in Computational Biology
>>> Division of Public Health Sciences
>>> Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N, M1-B514
>>> P.O. Box 19024
>>> Seattle, WA 98109-1024
>>>
>>> E-mail: hpages at fhcrc.org
>>> Phone: (206) 667-5791
>>> Fax: (206) 667-1319
>>>
>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fhcrc.org
>> Phone: (206) 667-5791
>> Fax: (206) 667-1319
>>
>
More information about the Bioconductor
mailing list