[Bioc-devel] making txdb, and propagating metadata from AnnotationHub to GenomicFeatures

Hervé Pagès hpages at fredhutch.org
Wed Jun 17 21:43:31 CEST 2015


Hi Michael,

On 06/17/2015 12:35 AM, Michael Love wrote:
> Background:
>
> With previous approaches that I would recommend to users for building
> txdb along the way of making count tables, it was desirable that the
> GTF release information would *automatically* be passed into the
> metadata of the rowRanges of the SummarizedExperiment.
>
> for example, in parathyroidSE, using makeTxDbFromBiomart:
>
> ...
> BioMart database version : chr "ENSEMBL GENES 72 (SANGER UK)"
> BioMart dataset : chr "hsapiens_gene_ensembl"
> BioMart dataset description : chr "Homo sapiens genes (GRCh37.p11)"
>
> or, alternatively, using makeTxDbFromGFF, it would be present in the
> name of the GTF file:
>
> ...
> Data source: chr "Homo_sapiens.GRCh37.75.gtf"
>
>
> Question:
>
> I'm now interested in switching over to using GTF files available
> through AnnotationHub, and wondering how we can maintain this
> automatic propagation of metadata.
>
> here's some example code:
>
> library(GenomicFeatures)
> library(AnnotationHub)
> ah <- AnnotationHub()
> z <- query(ah, c("Ensembl","gtf","Caenorhabditis elegans","release-80"))
> stopifnot(length(z) == 1)
> z$title
>
>   [1] "Caenorhabditis_elegans.WBcel235.80.gtf"
>
> gtf <- ah[[names(z)]]
> metadata(gtf)
>
>   $AnnotationHubName
>   [1] "AH47045"

A first (and hopefully easy) improvement would be that the GRanges and
other objects on AnnotationHub had more information in their metadata().

In the mean time, here is a workaround (in 2 steps):

1) Add some useful metadata to 'gtf' (grabbed from the AnnotationHub
    object):

   addMetadataFromAnnotationHub <- function(x, ah)
   {
     stopifnot(length(ah) == 1L)
     metadata0 <- list(
         `Data source`=ah$sourceurl,
         `Provider`=ah$dataprovider,
         `Organism`=ah$species,
         `Taxonomy ID`=ah$taxonomyid
     )
     metadata(x) <- c(metadata0, metadata(x))
     x
   }

   gtf <- addMetadataFromAnnotationHub(gtf, z)

   metadata(gtf)
   # $`Data source`
   # [1] 
"ftp://ftp.ensembl.org/pub/release-80/gtf/caenorhabditis_elegans/Caenorhabditis_elegans.WBcel235.80.gtf.gz"
   #
   # $Provider
   # [1] "Ensembl"
   #
   # $Organism
   # [1] "Caenorhabditis elegans"
   #
   # $`Taxonomy ID`
   # [1] 6239
   #
   # $AnnotationHubName
   # [1] "AH47045"

2) Pass the metadata to makeTxDbFromGRanges():

   txdb <- makeTxDbFromGRanges(gtf,
               metadata=data.frame(
                   name=names(metadata(gtf)),
                   value=as.character(metadata(gtf))))
   txdb
   # TxDb object:
   # Db type: TxDb
   # Supporting package: GenomicFeatures
   # Data source: 
ftp://ftp.ensembl.org/pub/release-80/gtf/caenorhabditis_elegans/Caenorhabditis_elegans.WBcel235.80.gtf.gz
   # Provider: Ensembl
   # Organism: Caenorhabditis elegans
   # Taxonomy ID: 6239
   # AnnotationHubName: AH47045
   # Genome: WBcel235
   # transcript_nrow: 57834
   # exon_nrow: 173506
   # cds_nrow: 131562
   # Db created by: GenomicFeatures package from Bioconductor
   # Creation time: 2015-06-17 12:24:51 -0700 (Wed, 17 Jun 2015)
   # GenomicFeatures version at creation time: 1.21.13
   # RSQLite version at creation time: 1.0.0
   # DBSCHEMAVERSION: 1.1

   organism(txdb)
   # [1] "Caenorhabditis elegans"
   taxonomyId(txdb)
   # [1] 6239

Step 2) should be made easier because the metadata is already in 'gtf'
so there is no reason why the user would need to pass it again thru the
'metadata' argument. I'll made that change to makeTxDbFromGRanges().

H.

>
> txdb <- makeTxDbFromGRanges(gtf)
> metadata(txdb)
>
>                                          name
>   1                                   Db type
>   2                        Supporting package
>   3                                    Genome
>   4                           transcript_nrow
>   5                                 exon_nrow
>   6                                  cds_nrow
>   7                             Db created by
>   8                             Creation time
>   9  GenomicFeatures version at creation time
>   10         RSQLite version at creation time
>   11                          DBSCHEMAVERSION
>                                             value
>   1                                          TxDb
>   2                               GenomicFeatures
>   3                                      WBcel235
>   4                                         57834
>   5                                        173506
>   6                                        131562
>   7     GenomicFeatures package from Bioconductor
>   8  2015-06-17 09:31:10 +0200 (Wed, 17 Jun 2015)
>   9                                       1.21.13
>   10                                        1.0.0
>   11                                          1.1
>
>
> sessionInfo()
>
>   R Under development (unstable) (2015-04-29 r68278)
>   Platform: x86_64-apple-darwin12.5.0 (64-bit)
>   Running under: OS X 10.10.3 (Yosemite)
>
>   locale:
>   [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>
>   attached base packages:
>   [1] stats4    parallel  stats     graphics  grDevices datasets  utils
>   [8] methods   base
>
>   other attached packages:
>    [1] AnnotationHub_2.1.26    GenomicFeatures_1.21.13 AnnotationDbi_1.31.16
>    [4] Biobase_2.29.1          GenomicRanges_1.21.15   GenomeInfoDb_1.5.7
>    [7] IRanges_2.3.11          S4Vectors_0.7.5         BiocGenerics_0.15.2
>   [10] testthat_0.10.0         knitr_1.10              BiocInstaller_1.19.6
>
>   loaded via a namespace (and not attached):
>    [1] Rcpp_0.11.6                  compiler_3.3.0
>    [3] futile.logger_1.4.1          XVector_0.9.1
>    [5] bitops_1.0-6                 futile.options_1.0.0
>    [7] tools_3.3.0                  zlibbioc_1.15.0
>    [9] biomaRt_2.25.1               digest_0.6.8
>   [11] RSQLite_1.0.0                memoise_0.2.1
>   [13] shiny_0.12.1                 DBI_0.3.1
>   [15] rtracklayer_1.29.10          httr_0.6.1
>   [17] stringr_1.0.0                Biostrings_2.37.2
>   [19] R6_2.0.1                     XML_3.98-1.2
>   [21] BiocParallel_1.3.26          lambda.r_1.1.7
>   [23] magrittr_1.5                 Rsamtools_1.21.8
>   [25] htmltools_0.2.6              GenomicAlignments_1.5.9
>   [27] SummarizedExperiment_0.1.5   xtable_1.7-4
>   [29] mime_0.3                     interactiveDisplayBase_1.7.0
>   [31] httpuv_1.3.2                 stringi_0.4-1
>   [33] RCurl_1.95-4.6               crayon_1.3.0
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-devel mailing list