[BioC] Question about MEDIPS extend parameter : RE: Success: RE: Seed file error: RE: BioStrings for current R: RE: BSgenomeForge seed file - seqnames field

Tue Dec 17 00:17:21 CET 2013

Hi Kelly,

yes, the MEDIPS extend parameter is supposed to be the size of the 
fragments that have been IP'd.

Best, Matthias

Am 16.12.2013 15:03, schrieb Hervé Pagès:
> Hi Kelly,
>
> I'm afraid you're asking the wrong person, sorry. The best place to ask
> this kind of question is on the mailing list (cc'ed) where there is a
> good chance that someone will be able to help you.
>
> Cheers,
> H.
>
>
> On 12/16/2013 02:55 PM, Vining, Kelly wrote:
>> Hi Herve,
>> I have a quick question about the 'extend' vs. 'shift' parameters in 
>> the MEDIPS package. From the documentation:
>>
>> All reads will be extended to a length of 300nt according to the 
>> given strandinformation:
>>> extend=300
>> As an alternative to the extend parameter, the
>> shift parameter can be specifed. Here, the reads are not extended but 
>> shifted by the specifed number
>> of nucleotides with respect to the given strand infomation. One of 
>> the two parameters extend or
>> shift has to be 0.
>>
>> Is this parameter supposed to represent the IP'd fragment size?
>>
>> Thanks,
>> --Kelly
>>
>> -----Original Message-----
>> From: Vining, Kelly
>> Sent: Thursday, December 12, 2013 6:51 AM
>> To: Hervé Pagès
>> Subject: Success: RE: Seed file error: RE: BioStrings for current R: 
>> RE: [BioC] BSgenomeForge seed file - seqnames field
>>
>> Hi Herve,
>> I have now succeeded in forging the poplar BSgenome object. I will 
>> have to use it in order to confirm that it's exactly as I expect, but 
>> just wanted to let you know I got this far and thank you again for 
>> your help.
>>
>> Cheers,
>> --Kelly
>> ________________________________________
>> From: Hervé Pagès [hpages at fhcrc.org]
>> Sent: Tuesday, December 10, 2013 2:38 PM
>> To: Vining, Kelly
>> Cc: bioconductor at r-project.org
>> Subject: Re: Seed file error: RE: BioStrings for current R: RE: 
>> [BioC] BSgenomeForge seed file - seqnames field
>>
>> Hi Kelly,
>>
>> Hard to tell without knowing exactly what you've downloaded but the 
>> 1st thing that strikes me when I look at your seed file below is 
>> that, depending on where I look exactly, the Populus trichocarpa 
>> genome seems to have either 12 chromosomes (based on your seqnames 
>> field), or 13 chromosomes (based on your SrcDataFiles1 field), or 19 
>> chromosomes (based on the error reported by forgeBSgenomeDataPkg()). 
>> You need to figure out how many chromosomes there are and stick to that.
>>
>> A few other things:
>>
>> - No ".fa" extensions in the seqnames.
>>
>> - Make sure you have one .fa file per chromosome. Each file should be
>>     named Chr01.fa, Chr02.fa, Chr03.fa, etc... Note that if you've
>>     downloaded Ptrichocarpa_210.fa.gz from
>>
>> ftp://ftp.jgi-psf.org/pub/compgen/phytozome/v9.0/Ptrichocarpa/assembly/
>>
>>     this file is a single FASTA file that contains all the chromosome
>>     sequences so you need to split it first in order to have one FASTA
>>     file per chromosome. See the splitbigfasta.R script in
>> BSgenome/inst/extdata/GentlemanLab/BSgenome.Btaurus.UCSC.bosTau6-tools
>>     for how to do this (you will need to adapt the script to your
>>     situation). I realize this splitting step is a burden and I have on
>>     my list to improve forgeBSgenomeDataPkg() so it won't be needed
>>     anymore.
>>
>> - Finally make sure the Chr01.fa, Chr02.fa, Chr03.fa, etc... files
>>     are in the folder specified in the seqs_srcdir field
>>     (/raid1/home/pi/viningk/Poplar/genomedb in your case).
>>
>> Let me know if that still doesn't work for you.
>>
>> Cheers,
>> H.
>>
>>
>> On 12/10/2013 12:56 PM, Vining, Kelly wrote:
>>> Hi Herve,
>>> I'm almost there with my BSgenome package, but am still getting an 
>>> error. It's clearly due to my seed file format, but I am not sure 
>>> what to change at this point. It appears that BSgenome is not seeing 
>>> all of my fasta files as separate. I tried not having the .fa 
>>> suffixes in the seqnames field, but got a similar "files not found" 
>>> error.
>>>
>>> Thanks again for your help.
>>>
>>> Error:
>>>> forgeBSgenomeDataPkg("BSgenome.Ptrichocarpa.Phytozome.v3-seed")
>>> Creating package in ./BSgenome.Ptrichocarpa.Phytozome.v3
>>> Error in getSeqSrcpaths(seqnames, prefix = prefix, suffix = suffix, 
>>> seqs_srcdir = seqs_srcdir) :
>>>     /raid1/home/pi/viningk/Poplar/genomedb/Chr01.fa Chr02.fa Chr03.fa
>>> Chr04.fa Chr05.fa Chr06.fa Chr07.fa Chr08.fa Chr09.fa Chr10.fa
>>> Chr11.fa Chr12.fa Chr13.fa Chr14.fa Chr15.fa Chr16.fa Chr17.fa
>>> Chr18.fa Chr19.fa.fa: file(s) not found
>>>
>>> My seed file:
>>>
>>> Package: BSgenome.Ptrichocarpa.Phytozome.v3
>>> Title: Populus trichocarpa full genome (Phytozome version 3)
>>> Description: Populus trichocarpa (Black cottonwood) genome as provided
>>> by Phytozome (v3, 2013) stored in Bioconductor
>>> Version: 3.0
>>> organism: Populus trichocarpa
>>> species: Black cottonwood
>>> provider: Phytozome (JGI)
>>> provider_version: 3.0
>>> release_date: January 2010
>>> release_name: Populus trichocarpa v3.0
>>> source_url:
>>> ftp://ftp.jgi-psf.org/pub/compgen/phytozome/v9.0/Ptrichocarpa/
>>> organism_biocview: Populus_trichocarpa
>>> BSgenomeObjname: Ptrichocarpa
>>> seqnames: paste("Chr01.fa", "Chr02.fa", "Chr03.fa", "Chr04.fa",
>>> "Chr05.fa", "Chr06.fa", "Chr07.fa", "Chr08.fa", "Chr09.fa",
>>> "Chr10.fa", "Chr11.fa", "Chr12.fa",$
>>> mseqnames: names(scaf_mseq)
>>> SrcDataFiles1: sequences: Chr01.fa, Chr02.fa, Chr03.fa, Chr04.fa,
>>> Chr05.fa, Chr06.fa, Chr07.fa, Chr08.fa, Chr09.fa, Chr10.fa, Chr11.fa,
>>> Chr12.fa, Chr13.fa, Chr$
>>> PkgExamples: genome$Chr01 # same as genome[["Chr01"]]
>>> seqs_srcdir: /raid1/home/pi/viningk/Poplar/genomedb
>>>
>>>
>>> ________________________________________
>>> From: Hervé Pagès [hpages at fhcrc.org]
>>> Sent: Monday, December 09, 2013 2:09 PM
>>> To: Vining, Kelly
>>> Cc: bioconductor at r-project.org
>>> Subject: Re: BioStrings for current R: RE: [BioC] BSgenomeForge seed
>>> file - seqnames field
>>>
>>> Hi Kelly,
>>>
>>> On 12/06/2013 12:23 PM, Vining, Kelly wrote:
>>>> Hi Herve,
>>>> It has been a while since I have worked with Biostrings and the 
>>>> MEDIPS package. When I went to install Biostrings just now, got an 
>>>> error message that Biostrings isn't available:
>>>>
>>>>> install.packages("Biostrings")
>>>> Installing package into â/raid1/home/pi/viningk/Râ (as âlibâ is
>>>> unspecified)
>>>> --- Please select a CRAN mirror for use in this session ---
>>>
>>> Here you are asked to choose a CRAN mirror but Biostrings is not a
>>> CRAN package. This suggests that you might be on the wrong path.
>>>
>>>
>>>> Warning message:
>>>> package âBiostringsâ is not available (for R version 3.0.0)
>>>>
>>>> Revisiting the online documentation, it does look like the package 
>>>> is still available.
>>>> Is there an available update, or a different alternative I should 
>>>> be using?
>>>
>>> The only proper way to install a Bioconductor package is to use
>>> biocLite(), as documented here:
>>>
>>>      http://bioconductor.org/install/
>>>
>>> Unlike install.packages(), biocLite() "knows" where to find
>>> Bioconductor packages. Also make sure you're using the latest BioC
>>> release (BioC 2.13).
>>>
>>> Cheers,
>>> H.
>>>
>>>>
>>>> Thanks,
>>>> --Kelly Vining
>>>> ________________________________________
>>>> From: Hervé Pagès [hpages at fhcrc.org]
>>>> Sent: Thursday, May 02, 2013 10:49 AM
>>>> To: Vining, Kelly
>>>> Cc: Kelly V [guest]; bioconductor at r-project.org; MEDIPS Maintainer
>>>> Subject: Re: [BioC] BSgenomeForge seed file - seqnames field
>>>>
>>>> Hi Kelly,
>>>>
>>>> On 05/02/2013 06:44 AM, Vining, Kelly wrote:
>>>>> Hi Herve,
>>>>> Thanks for your helpful response, and for pointing me to the 
>>>>> proper scripts. Given this, it appears that I should treat my gene 
>>>>> feature annotation files (promoters, genes) as mseqnames objects, 
>>>>> similarly to how I will treat the extrachromosomal scaffolds. Is 
>>>>> there any way to include additional features that I may find of 
>>>>> interest after I forge this initial package (say, using gff 
>>>>> files), or am I limited such that I would have to re-forge a new 
>>>>> reference?
>>>>
>>>> A BSgenome data package is for storing the sequences of a given
>>>> genome/assembly. No annotations should go there.
>>>>
>>>> In Bioconductor we try to maintain a clear separation between
>>>> packages that contain the sequences of a genome/assembly (BSgenome
>>>> data packages), and packages that contain annotations for that
>>>> genome/assembly (TxDb, FDb, OrganismDb, SNPlocs, etc... packages).
>>>>
>>>> There are at least 2 reasons for this:
>>>>
>>>>       1. For a given assembly there are many kinds of annotations 
>>>> available
>>>>          in many places on the internet, and some of them tend to 
>>>> be updated
>>>>          frequently. By contrast, the sequences of a given assembly
>>>>          never change. So putting only the reference sequences in a
>>>>          BSgenome package make it very stable. It almost never 
>>>> needs to
>>>>          be updated, except for updating a man page or when something
>>>>          changes in the way sequences are stored in the package. 
>>>> This is
>>>>          good with such big packages (800 Mb for the biggest ones).
>>>>
>>>>       2. Storing the DNA sequences of the genes, promoters, or 
>>>> other genomic
>>>>          feature in a BSgenome package would duplicate a lot of DNA
>>>>          sequences that are already in the reference sequences (those
>>>>          small sequences are substrings of the reference sequences).
>>>>          This would make the BSgenome package unnecessarily bigger.
>>>>          It's better to store only the locations of those genomic 
>>>> features,
>>>>          not in the BSgenome package, but in a separate package, 
>>>> and then
>>>>          to use those locations to extract the corresponding 
>>>> sequences from
>>>>          the BSgenome object. This extraction can be done with 
>>>> getSeq()
>>>>          (defined in the BSgenome software package) and is relatively
>>>>          cheap. This approach gives you a lot more flexibility than 
>>>> having
>>>>          those sequences pre-extracted and bundled with the 
>>>> BSgenome data
>>>>          package.
>>>>
>>>>          If your annotations are in GFF files, you don't really 
>>>> need to
>>>>          put them in a package: you can load them directly in GRanges
>>>>          objects with import() (defined in the rtracklayer package),
>>>>          or if you are particularly interested in the exon/transcript
>>>>          structure of your genes, you could use 
>>>> makeTranscriptDbFromGFF()
>>>>          to load a GFF file containing genes, mRNAs, exons, and 
>>>> CDSs into
>>>>          a TranscriptDb. See ?makeTranscriptDbFromGFF in the
>>>>          GenomicFeatures package for more information. Then it's 
>>>> easy to
>>>>          extract the locations of transcripts, exons, or CDSs as a 
>>>> GRanges
>>>>          or GRangesList object from this TranscriptDb object. See
>>>>          ?transcripts and ?exonsBy. Finally once you have the 
>>>> locations
>>>>          in a GRanges object, you can use getSeq() to extract the
>>>>          corresponding sequences from the BSgenome object.
>>>>
>>>>          Other more specialized functions are promoters() and
>>>>          getPromoterSeq() for extracting the promoter locations and
>>>>          their sequences, respectively. Also 
>>>> extractTranscriptsFromGenome()
>>>>          for extracting the full transcriptome from a BSgenome 
>>>> package.
>>>>
>>>> HTH,
>>>> H.
>>>>
>>>> PS: The upstream sequences that you see in some of our BSgenome data
>>>> packages are a relic of the past and will be removed soon.
>>>>
>>>>>
>>>>> Thanks much,
>>>>> --Kelly V.
>>>>>
>>>>> -----Original Message-----
>>>>> From: Hervé Pagès [mailto:hpages at fhcrc.org]
>>>>> Sent: Tuesday, April 30, 2013 11:12 PM
>>>>> To: Kelly V [guest]
>>>>> Cc: bioconductor at r-project.org; Vining, Kelly; MEDIPS Maintainer
>>>>> Subject: Re: [BioC] BSgenomeForge seed file - seqnames field
>>>>>
>>>>> Hi Kelly,
>>>>>
>>>>> On 04/30/2013 03:52 PM, Kelly V [guest] wrote:
>>>>>>
>>>>>> I'm preparing a custom reference genome for use with the MEDIPS 
>>>>>> package. I see that one field of the seed file, which is 
>>>>>> apparently not optional, is the 'seqnames' field. The example 
>>>>>> given in the documentation is this:
>>>>>>
>>>>>> paste("chr", c(1:20, "X", "M", "Un", paste(c(1:20, "X", "Un"),
>>>>>> "_random", sep="")), sep="")
>>>>>>
>>>>>> I have two simple questions about this.
>>>>>>
>>>>>> 1. Does R match this information with the source sequence file? 
>>>>>> For example, if I have a single fasta file with fasta headers 
>>>>>> chr_01...chr_20, must the seqnames entries exactly match those 
>>>>>> headers?
>>>>>
>>>>> No. You need to provide 1 FASTA file per single sequence, that is, 
>>>>> 1 file per name you put in the 'seqnames' field. That means that 
>>>>> each file is expected to contain only 1 sequence. What's in the 
>>>>> FASTA header of each file is not important. What's important is 
>>>>> that the name of each file be of the form 
>>>>> <prefix><seqname><suffix>, where <seqname> is the name of the 
>>>>> sequence as it appears in the 'seqnames' field, and <prefix> and 
>>>>> <suffix> are a prefix and a suffix (eventually empty) that are the 
>>>>> same for all the files.
>>>>>
>>>>> If what you have is a big FASTA file containing all the chromosome 
>>>>> sequences, then you first need to split it into 1 file per 
>>>>> chromosome.
>>>>> This is easy to do in R. For example, here is the script I used to 
>>>>> split the big bosTau6.fa file provided by UCSC for the bosTau6 
>>>>> genome:
>>>>>
>>>>>        library(Biostrings)
>>>>>        bosTau6 <- readDNAStringSet("bosTau6.fa")
>>>>>
>>>>>        ### Partitioning:
>>>>>        is_chrUn <- grepl("^chrUn", names(bosTau6))
>>>>>        is_chrom <- !is_chrUn
>>>>>
>>>>>        ### Send each chromosome to a FASTA file.
>>>>>        seqnames <- paste("chr", c(1:29, "X", "M"), sep="")
>>>>>        stopifnot(setequal(seqnames, names(bosTau6)[is_chrom]))
>>>>>
>>>>>        for (seqname in seqnames) {
>>>>>          seq <- bosTau6[match(seqname, names(bosTau6))]
>>>>>          filename <- paste(seqname, ".fa", sep="")
>>>>>          cat("writing ", filename, "\n", sep="")
>>>>>          writeXStringSet(seq, file=filename, width=50L)
>>>>>        }
>>>>>
>>>>>        ### Send the 3286 chrUn_* sequences to 1 FASTA file.
>>>>>        chrUn_mseq <- bosTau6[is_chrUn]
>>>>>        writeXStringSet(chrUn_mseq, file="chrUn.fa", width=50L)
>>>>>
>>>>> The input is the bosTau6.fa file containing 3317 FASTA records:
>>>>> 31 records for the chromosomes, and 3286 for the chrUn_* sequences.
>>>>> The script produces 32 FASTA files: 1 per chromosome (chr1.fa, 
>>>>> chr2.fa, chr3.fa, ..., chr29.fa, chrX.fa, chrM.fa), and the 
>>>>> chrUn.fa file (containing the 3286 chrUn_* sequences).
>>>>>
>>>>> Note that you can find this script in the BSgenome package, and 
>>>>> display its source with:
>>>>>
>>>>>        > splitbigfasta_R <- system.file("extdata",
>>>>>                                         "GentlemanLab",
>>>>> "BSgenome.Btaurus.UCSC.bosTau6-tools",
>>>>>                                         "splitbigfasta.R",
>>>>> package="BSgenome")
>>>>>
>>>>>        > cat(readLines(splitbigfasta_R), sep="\n")
>>>>>
>>>>> It should not be too hard to adapt this script to your own needs.
>>>>>
>>>>>>
>>>>>> 2. Revealing the reason for my first question:In my genome fasta 
>>>>>> file, I have 1427 extrachromosomal scaffolds, but they are not 
>>>>>> all sequentially numbered, so that I have 
>>>>>> scaffold_1..scaffold_3681. Do I need to use a regular expression 
>>>>>> in my seqnames field to tell R to look for scaffold_ followed by 
>>>>>> 1-4 digits?
>>>>>
>>>>> No, not in the 'seqnames' field, because those 1427 
>>>>> extrachromosomal scaffolds should not go there. They would need to 
>>>>> go in the 'mseqnames'
>>>>> field.
>>>>>
>>>>> Unlike the 'seqnames' field where you enumerate objects that can
>>>>> only contain 1 sequence, in the 'mseqnames' field you can enumerate
>>>>> objects that contain more than 1 sequence. More precisely, in the
>>>>> final BSgenome data package, each entry in the 'seqnames' field will
>>>>> correspond to a DNAString object (the DNAString container can hold
>>>>> 1 sequence only), and each entry in the 'mseqnames' field will 
>>>>> correspond to a DNAStringSet object (the DNAStringSet container 
>>>>> can hold multiple sequences).
>>>>>
>>>>> So typically, all the extrachromosomal scaffolds would go in one 
>>>>> DNAStringSet object in the final BSgenome package (this is for 
>>>>> example what is done in the BSgenome.Drerio.UCSC.danRer7 package).
>>>>> To achieve this, you need to put 1 entry in the 'mseqnames' field 
>>>>> (e.g. "scaffolds"), and to put the 1427 extrachromosomal scaffolds 
>>>>> in one FASTA file named accordingly to that entry (e.g. 
>>>>> scaffolds.fa).
>>>>>
>>>>> In the above script, replace:
>>>>>
>>>>>        is_chrUn <- grepl("^chrUn", names(bosTau6))
>>>>>
>>>>> with:
>>>>>
>>>>>        is_scaffold <- grepl("^scaffold", names(<your_genome>))
>>>>>
>>>>> then every occurrence of 'is_chrUn' with 'is_scaffold', and 
>>>>> finally those 3 lines:
>>>>>
>>>>>        ### Send the 3286 chrUn_* sequences to 1 FASTA file.
>>>>>        chrUn_mseq <- bosTau6[is_chrUn]
>>>>>        writeXStringSet(chrUn_mseq, file="chrUn.fa", width=50L)
>>>>>
>>>>> with:
>>>>>
>>>>>        ### Send the 1427 scaffold_* sequences to 1 FASTA file.
>>>>>        scaffolds_mseq <- <your_genome>[is_scaffold]
>>>>>        writeXStringSet(scaffolds_mseq, file="scaffolds.fa",
>>>>> width=50L)
>>>>>
>>>>> and that should take care of sending all the scaffold sequences to 
>>>>> the scaffolds.fa file.
>>>>>
>>>>> Let me know if you need further assistance with this.
>>>>>
>>>>> Cheers,
>>>>> H.
>>>>>
>>>>>>
>>>>>> Thanks for any help,
>>>>>> --Kelly V.
>>>>>>
>>>>>>       -- output of sessionInfo():
>>>>>>
>>>>>> R version 3.0.0 (2013-04-03)
>>>>>> Platform: i386-w64-mingw32/i386 (32-bit)
>>>>>>
>>>>>> locale:
>>>>>> [1] LC_COLLATE=English_United States.1252 [2]
>>>>>> LC_CTYPE=English_United
>>>>>> States.1252 [3] LC_MONETARY=English_United States.1252 [4]
>>>>>> LC_NUMERIC=C [5] LC_TIME=English_United States.1252
>>>>>>
>>>>>> attached base packages:
>>>>>> [1] stats     graphics  grDevices utils     datasets methods   base
>>>>>>
>>>>>> -- 
>>>>>> Sent via the guest posting facility at bioconductor.org.
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioconductor mailing list
>>>>>> Bioconductor at r-project.org
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>> Search the archives:
>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>
>>>>>
>>>>> -- 
>>>>> Hervé Pagès
>>>>>
>>>>> Program in Computational Biology
>>>>> Division of Public Health Sciences
>>>>> Fred Hutchinson Cancer Research Center
>>>>> 1100 Fairview Ave. N, M1-B514
>>>>> P.O. Box 19024
>>>>> Seattle, WA 98109-1024
>>>>>
>>>>> E-mail: hpages at fhcrc.org
>>>>> Phone:  (206) 667-5791
>>>>> Fax:    (206) 667-1319
>>>>>
>>>>
>>>> -- 
>>>> Hervé Pagès
>>>>
>>>> Program in Computational Biology
>>>> Division of Public Health Sciences
>>>> Fred Hutchinson Cancer Research Center
>>>> 1100 Fairview Ave. N, M1-B514
>>>> P.O. Box 19024
>>>> Seattle, WA 98109-1024
>>>>
>>>> E-mail: hpages at fhcrc.org
>>>> Phone:  (206) 667-5791
>>>> Fax:    (206) 667-1319
>>>>
>>>
>>> -- 
>>> Hervé Pagès
>>>
>>> Program in Computational Biology
>>> Division of Public Health Sciences
>>> Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N, M1-B514
>>> P.O. Box 19024
>>> Seattle, WA 98109-1024
>>>
>>> E-mail: hpages at fhcrc.org
>>> Phone:  (206) 667-5791
>>> Fax:    (206) 667-1319
>>>
>>
>> -- 
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fhcrc.org
>> Phone:  (206) 667-5791
>> Fax:    (206) 667-1319
>>
>