[BioC] BioStrings for current R: RE: BSgenomeForge seed file - seqnames field
Hervé Pagès
hpages at fhcrc.org
Mon Dec 9 23:09:09 CET 2013
Hi Kelly,
On 12/06/2013 12:23 PM, Vining, Kelly wrote:
> Hi Herve,
> It has been a while since I have worked with Biostrings and the MEDIPS package. When I went to install Biostrings just now, got an error message that Biostrings isn't available:
>
>> install.packages("Biostrings")
> Installing package into â/raid1/home/pi/viningk/Râ
> (as âlibâ is unspecified)
> --- Please select a CRAN mirror for use in this session ---
Here you are asked to choose a CRAN mirror but Biostrings is not a CRAN
package. This suggests that you might be on the wrong path.
> Warning message:
> package âBiostringsâ is not available (for R version 3.0.0)
>
> Revisiting the online documentation, it does look like the package is still available.
> Is there an available update, or a different alternative I should be using?
The only proper way to install a Bioconductor package is to use
biocLite(), as documented here:
http://bioconductor.org/install/
Unlike install.packages(), biocLite() "knows" where to find Bioconductor
packages. Also make sure you're using the latest BioC release (BioC
2.13).
Cheers,
H.
>
> Thanks,
> --Kelly Vining
> ________________________________________
> From: Hervé Pagès [hpages at fhcrc.org]
> Sent: Thursday, May 02, 2013 10:49 AM
> To: Vining, Kelly
> Cc: Kelly V [guest]; bioconductor at r-project.org; MEDIPS Maintainer
> Subject: Re: [BioC] BSgenomeForge seed file - seqnames field
>
> Hi Kelly,
>
> On 05/02/2013 06:44 AM, Vining, Kelly wrote:
>> Hi Herve,
>> Thanks for your helpful response, and for pointing me to the proper scripts. Given this, it appears that I should treat my gene feature annotation files (promoters, genes) as mseqnames objects, similarly to how I will treat the extrachromosomal scaffolds. Is there any way to include additional features that I may find of interest after I forge this initial package (say, using gff files), or am I limited such that I would have to re-forge a new reference?
>
> A BSgenome data package is for storing the sequences of a given
> genome/assembly. No annotations should go there.
>
> In Bioconductor we try to maintain a clear separation between
> packages that contain the sequences of a genome/assembly (BSgenome
> data packages), and packages that contain annotations for that
> genome/assembly (TxDb, FDb, OrganismDb, SNPlocs, etc... packages).
>
> There are at least 2 reasons for this:
>
> 1. For a given assembly there are many kinds of annotations available
> in many places on the internet, and some of them tend to be updated
> frequently. By contrast, the sequences of a given assembly
> never change. So putting only the reference sequences in a
> BSgenome package make it very stable. It almost never needs to
> be updated, except for updating a man page or when something
> changes in the way sequences are stored in the package. This is
> good with such big packages (800 Mb for the biggest ones).
>
> 2. Storing the DNA sequences of the genes, promoters, or other genomic
> feature in a BSgenome package would duplicate a lot of DNA
> sequences that are already in the reference sequences (those
> small sequences are substrings of the reference sequences).
> This would make the BSgenome package unnecessarily bigger.
> It's better to store only the locations of those genomic features,
> not in the BSgenome package, but in a separate package, and then
> to use those locations to extract the corresponding sequences from
> the BSgenome object. This extraction can be done with getSeq()
> (defined in the BSgenome software package) and is relatively
> cheap. This approach gives you a lot more flexibility than having
> those sequences pre-extracted and bundled with the BSgenome data
> package.
>
> If your annotations are in GFF files, you don't really need to
> put them in a package: you can load them directly in GRanges
> objects with import() (defined in the rtracklayer package),
> or if you are particularly interested in the exon/transcript
> structure of your genes, you could use makeTranscriptDbFromGFF()
> to load a GFF file containing genes, mRNAs, exons, and CDSs into
> a TranscriptDb. See ?makeTranscriptDbFromGFF in the
> GenomicFeatures package for more information. Then it's easy to
> extract the locations of transcripts, exons, or CDSs as a GRanges
> or GRangesList object from this TranscriptDb object. See
> ?transcripts and ?exonsBy. Finally once you have the locations
> in a GRanges object, you can use getSeq() to extract the
> corresponding sequences from the BSgenome object.
>
> Other more specialized functions are promoters() and
> getPromoterSeq() for extracting the promoter locations and
> their sequences, respectively. Also extractTranscriptsFromGenome()
> for extracting the full transcriptome from a BSgenome package.
>
> HTH,
> H.
>
> PS: The upstream sequences that you see in some of our BSgenome data
> packages are a relic of the past and will be removed soon.
>
>>
>> Thanks much,
>> --Kelly V.
>>
>> -----Original Message-----
>> From: Hervé Pagès [mailto:hpages at fhcrc.org]
>> Sent: Tuesday, April 30, 2013 11:12 PM
>> To: Kelly V [guest]
>> Cc: bioconductor at r-project.org; Vining, Kelly; MEDIPS Maintainer
>> Subject: Re: [BioC] BSgenomeForge seed file - seqnames field
>>
>> Hi Kelly,
>>
>> On 04/30/2013 03:52 PM, Kelly V [guest] wrote:
>>>
>>> I'm preparing a custom reference genome for use with the MEDIPS package. I see that one field of the seed file, which is apparently not optional, is the 'seqnames' field. The example given in the documentation is this:
>>>
>>> paste("chr", c(1:20, "X", "M", "Un", paste(c(1:20, "X", "Un"),
>>> "_random", sep="")), sep="")
>>>
>>> I have two simple questions about this.
>>>
>>> 1. Does R match this information with the source sequence file? For example, if I have a single fasta file with fasta headers chr_01...chr_20, must the seqnames entries exactly match those headers?
>>
>> No. You need to provide 1 FASTA file per single sequence, that is, 1 file per name you put in the 'seqnames' field. That means that each file is expected to contain only 1 sequence. What's in the FASTA header of each file is not important. What's important is that the name of each file be of the form <prefix><seqname><suffix>, where <seqname> is the name of the sequence as it appears in the 'seqnames' field, and <prefix> and <suffix> are a prefix and a suffix (eventually empty) that are the same for all the files.
>>
>> If what you have is a big FASTA file containing all the chromosome sequences, then you first need to split it into 1 file per chromosome.
>> This is easy to do in R. For example, here is the script I used to split the big bosTau6.fa file provided by UCSC for the bosTau6 genome:
>>
>> library(Biostrings)
>> bosTau6 <- readDNAStringSet("bosTau6.fa")
>>
>> ### Partitioning:
>> is_chrUn <- grepl("^chrUn", names(bosTau6))
>> is_chrom <- !is_chrUn
>>
>> ### Send each chromosome to a FASTA file.
>> seqnames <- paste("chr", c(1:29, "X", "M"), sep="")
>> stopifnot(setequal(seqnames, names(bosTau6)[is_chrom]))
>>
>> for (seqname in seqnames) {
>> seq <- bosTau6[match(seqname, names(bosTau6))]
>> filename <- paste(seqname, ".fa", sep="")
>> cat("writing ", filename, "\n", sep="")
>> writeXStringSet(seq, file=filename, width=50L)
>> }
>>
>> ### Send the 3286 chrUn_* sequences to 1 FASTA file.
>> chrUn_mseq <- bosTau6[is_chrUn]
>> writeXStringSet(chrUn_mseq, file="chrUn.fa", width=50L)
>>
>> The input is the bosTau6.fa file containing 3317 FASTA records:
>> 31 records for the chromosomes, and 3286 for the chrUn_* sequences.
>> The script produces 32 FASTA files: 1 per chromosome (chr1.fa, chr2.fa, chr3.fa, ..., chr29.fa, chrX.fa, chrM.fa), and the chrUn.fa file (containing the 3286 chrUn_* sequences).
>>
>> Note that you can find this script in the BSgenome package, and display its source with:
>>
>> > splitbigfasta_R <- system.file("extdata",
>> "GentlemanLab",
>> "BSgenome.Btaurus.UCSC.bosTau6-tools",
>> "splitbigfasta.R",
>> package="BSgenome")
>>
>> > cat(readLines(splitbigfasta_R), sep="\n")
>>
>> It should not be too hard to adapt this script to your own needs.
>>
>>>
>>> 2. Revealing the reason for my first question:In my genome fasta file, I have 1427 extrachromosomal scaffolds, but they are not all sequentially numbered, so that I have scaffold_1..scaffold_3681. Do I need to use a regular expression in my seqnames field to tell R to look for scaffold_ followed by 1-4 digits?
>>
>> No, not in the 'seqnames' field, because those 1427 extrachromosomal scaffolds should not go there. They would need to go in the 'mseqnames'
>> field.
>>
>> Unlike the 'seqnames' field where you enumerate objects that can only contain 1 sequence, in the 'mseqnames' field you can enumerate objects that contain more than 1 sequence. More precisely, in the final BSgenome data package, each entry in the 'seqnames' field will correspond to a DNAString object (the DNAString container can hold
>> 1 sequence only), and each entry in the 'mseqnames' field will correspond to a DNAStringSet object (the DNAStringSet container can hold multiple sequences).
>>
>> So typically, all the extrachromosomal scaffolds would go in one DNAStringSet object in the final BSgenome package (this is for example what is done in the BSgenome.Drerio.UCSC.danRer7 package).
>> To achieve this, you need to put 1 entry in the 'mseqnames' field (e.g. "scaffolds"), and to put the 1427 extrachromosomal scaffolds in one FASTA file named accordingly to that entry (e.g. scaffolds.fa).
>>
>> In the above script, replace:
>>
>> is_chrUn <- grepl("^chrUn", names(bosTau6))
>>
>> with:
>>
>> is_scaffold <- grepl("^scaffold", names(<your_genome>))
>>
>> then every occurrence of 'is_chrUn' with 'is_scaffold', and finally those 3 lines:
>>
>> ### Send the 3286 chrUn_* sequences to 1 FASTA file.
>> chrUn_mseq <- bosTau6[is_chrUn]
>> writeXStringSet(chrUn_mseq, file="chrUn.fa", width=50L)
>>
>> with:
>>
>> ### Send the 1427 scaffold_* sequences to 1 FASTA file.
>> scaffolds_mseq <- <your_genome>[is_scaffold]
>> writeXStringSet(scaffolds_mseq, file="scaffolds.fa", width=50L)
>>
>> and that should take care of sending all the scaffold sequences to the scaffolds.fa file.
>>
>> Let me know if you need further assistance with this.
>>
>> Cheers,
>> H.
>>
>>>
>>> Thanks for any help,
>>> --Kelly V.
>>>
>>> -- output of sessionInfo():
>>>
>>> R version 3.0.0 (2013-04-03)
>>> Platform: i386-w64-mingw32/i386 (32-bit)
>>>
>>> locale:
>>> [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United
>>> States.1252 [3] LC_MONETARY=English_United States.1252 [4]
>>> LC_NUMERIC=C [5] LC_TIME=English_United States.1252
>>>
>>> attached base packages:
>>> [1] stats graphics grDevices utils datasets methods base
>>>
>>> --
>>> Sent via the guest posting facility at bioconductor.org.
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fhcrc.org
>> Phone: (206) 667-5791
>> Fax: (206) 667-1319
>>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone: (206) 667-5791
> Fax: (206) 667-1319
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list