[BioC] Working with ensembl 73 & BSGenome

Thu Oct 31 09:33:07 CET 2013

On 10/30/2013 11:54 PM, Hervé Pagès wrote:
> Hi Tim,
>
> On 10/30/2013 09:39 PM, Timothy Johnstone [guest] wrote:
>> Hi,
>> I'm working with the latest annotation set from Ensembl (ens73) which is based
>> on the patched GRCh37.p12 assembly. I have retrieved the transcript set from
>> Ensembl biomart using GenomicFeatures:makeTranscriptDbFromBiomart().
>>
>> One of the things I'd like to do is create a DNAStringSet of sequences for all
>> the transcripts in my transcriptDB using the
>> GenomicFeatures:extractTranscriptsFromGenome() function. This takes a TDB and
>> a BSGenomes object as input. However, the latest BSGenomes available for the
>> human is UCSC.hg19, which is unpatched. When I run the command, I get the error:
>> Error in .getOneSeqFromBSgenomeMultipleSequences(x, names[i], start[i],  :
>>    sequence ^1$ not found
>>
>> I'm pretty sure this is because the transcriptDB contains sequences
>> (patches/scaffolds) that are present in the patched assembly but not the base
>> GRCh37 assembly. Additionally the nomenclature is different between UCSC and
>> Ensembl (e.g. chr1 ; 1).
>
> Based on the GenBank ID (or RefSeq ID) of the 24 chromosomes listed
> here:
>
>    http://www.ncbi.nlm.nih.gov/assembly/GCA_000001405.1/
>    http://www.ncbi.nlm.nih.gov/assembly/GCA_000001405.13/
>
> the GRCh37 primary assembly (i.e. the exact sequence of the 24
> chromosomes) has not changed between GRCh37 and GRCh37.p12.
>
> Also according to the note here:
>
>    http://genome.ucsc.edu/cgi-bin/hgGateway?clade=mammal&org=Human&db=hg19
>
> chrM is not the same in GRCh37 as in the UCSC hg19 assembly.
>
> So if you want to use Ensembl 73 (based on GRCh37.p12) to extract the
> transcript sequences from a BSgenome package, and if restricting this
> extraction to the 24 chromosomes is OK with you, then you can use the
> BSgenome.Hsapiens.UCSC.hg19 package. But since Ensembl and UCSC use
> different chromosome naming conventions, you will have to adjust the
> chromosome names. Here is how to proceed:
>
>    library(GenomicFeatures)
>    txdb <- makeTranscriptDbFromBiomart(biomart="ensembl",
> dataset="hsapiens_gene_ensembl")
>    ex_by_tx <- exonsBy(txdb, by="tx")
>    seqlevels(ex_by_tx, force=TRUE) <- c(1:22, "X", "Y")  # restrict
>    seqlevels(ex_by_tx) <- paste0("chr", seqlevels(ex_by_tx))  # rename
>
>    library(BSgenome.Hsapiens.UCSC.hg19)
>    tx_seqs <- extractTranscriptsFromGenome(Hsapiens, ex_by_tx)  # extract

Herve -- is there a way to use the fasta file from Ensembl? Download, compress 
and index it using Rsamtools::razip and Rsamtools::indexFa. The data source is

   http://www.ensembl.org/info/data/ftp/index.html

Then use FaFile to reference the file and getSeq,FaFile-method to get sequences.

   library(Rsamtools)
   fa = FaFile("/path/to/GRCh37.72.dna.toplevel.fa.rz")

and then I think

   library(GenomicFeatures)
   dnaList = lapply(seqlevels(ex_by_tx), function(lvl) {
        ## getSeq on the FaFile
        ## extractTranscripts with seq and ex_by_tx
   }
   ## munge to dnaList to DNAStringSet

>
> If you want to extract all the transcripts reported in Ensembl 73 (i.e.
> not only the 195381 transcripts located on the 24 chromosomes but also
> the 18913 transcripts located on the alt loci and unplaced contigs),
> then you'll need to forge a BSGenome data package for GRCh37.p12. Look
> at the BSgenomeForge vignette in the BSgenome software package for how
> to do this. It looks like one extra difficulty with GRCh37.p12 (and
> previous GRCh37 patches) is that there doesn't seem to be an easy way
> to download the complete assembly from NCBI as FASTA file(s). Last
> time I checked (was around GRCh37.p7), it seemed that only the diff
> information was available in some kind of specialized file format so
> my impression was that one had to download the diff and then use it
> to "patch" the GRCh37 sequences in order to produce the GRCh37.p12
> FASTA sequences. All this didn't seem completely straightforward to me
> but I didn't really spend enough time to know.
>
> Ideally, we should improve the BSgenome infrastructure so that
> we can also follow that diff model i.e. it would be great if we were

Or use AnnotationHub to retrieve the fasta and gtf files and smooth out the 
approach above? AnnotationHub is lagging a little as we try to improve the 
behind-the-scenes infrastructure, but the indexed FaFile (and gtf) for release 
72 is available with

   library(AnnotationHub)
   hub = AnnotationHub()
   fa = 
hub$ensembl.release.72.fasta.homo_sapiens.dna.Homo_sapiens.GRCh37.72.dna.toplevel.fa.rz
   gtf =
hub$ensembl.release.72.gtf.homo_sapiens.Homo_sapiens.GRCh37.72.gtf_0.0.1.RData

Martin

> able to forge light-weight BSgenome data packages that only contain
> the diff against another BSgenome data package. For example
> BSgenome.Hsapiens.GRCh37.p12 could be forged as a light-weight
> package that only contains the diff against (and depends on)
> BSgenome.Hsapiens.GRCh37, which would be a heavy-weight package
> (i.e. it contains the full sequences). From a user point of view
> there would be no difference between light-weight and heavy-weight,
> that is, both would contain a BSgenome object that looks like it
> contains a full assembly. But from a data management point of view
> that would make a big difference as we would then be able to keep
> up with the pace of GRCh37 patches by providing one BSgenome data
> package for each of them at a low cost.
>
> Cheers,
> H.
>
>
>>
>> I see a few options here. One obvious one would be to stick with UCSC hg19 and
>> use the UCSC ensGene table, but others in my working group are using ens73 so
>> this is a suboptimal solution. Is there an updated BSGenome available for
>> GRCh37.p12, or an easy way to forge one? Have others encountered this issue?
>>
>> Thanks!
>> Tim Johnstone
>>
>>   -- output of sessionInfo():
>>
>> R version 3.0.2 (2013-09-25)
>> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>>
>> locale:
>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>>
>> attached base packages:
>>   [1] grDevices datasets  splines   tcltk     utils     parallel  stats
>> graphics  methods
>> [10] base
>>
>> other attached packages:
>>   [1] BSgenome.Hsapiens.UCSC.hg19_1.3.19  BiocInstaller_1.12.0
>>   [3] data.table_1.8.10                   Hmisc_3.12-2
>>   [5] Formula_1.1-1                       survival_2.37-4
>>   [7] plyr_1.8                            gdata_2.13.2
>>   [9] ShortRead_1.20.0                    lattice_0.20-24
>> [11] rtracklayer_1.22.0                  Rsamtools_1.14.1
>> [13] BSgenome.Drerio.UCSC.danRer7_1.3.17 BSgenome_1.30.0
>> [15] Biostrings_2.30.0                   lessR_2.9.7
>> [17] GenomicFeatures_1.14.0              AnnotationDbi_1.24.0
>> [19] Biobase_2.22.0                      GenomicRanges_1.14.3
>> [21] XVector_0.2.0                       IRanges_1.20.4
>> [23] BiocGenerics_0.8.0
>>
>> loaded via a namespace (and not attached):
>>   [1] biomaRt_2.18.0      bitops_1.0-6        car_2.0-19          cluster_1.14.4
>>   [5] DBI_0.2-7           foreign_0.8-57      grid_3.0.2          gtools_3.1.0
>>   [9] hwriter_1.3         latticeExtra_0.6-26 leaps_2.9           MASS_7.3-29
>> [13] MBESS_3.3.3         nnet_7.3-7          RColorBrewer_1.0-5  RCurl_1.95-4.1
>> [17] rpart_4.1-3         RSQLite_0.11.4      stats4_3.0.2        tools_3.0.2
>> [21] XML_3.95-0.2        zlibbioc_1.8.0
>>
>> --
>> Sent via the guest posting facility at bioconductor.org.
>>
>

-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793