[Bioc-devel] makeTranscriptDbFrom... AnnotationHub

Tue Jul 8 21:11:09 CEST 2014

The recent TranscriptDb thread reminded me of a question: are there
plans (or am I missing the function) to easily get a TranscriptDb out
of the AnnotationHub objects? It would be great to have a preprocessed
Ensembl txdb like we have for UCSC.

> ah <- AnnotationHub()
> gr <- ah$ensembl.release.73.gtf.homo_sapiens.Homo_sapiens.GRCh37.73.gtf_0.0.1.RData
> gr
GRanges with 2268089 ranges and 12 metadata columns:
            seqnames         ranges strand   |                 source
               <Rle>      <IRanges>  <Rle>   |               <factor>
        [1]        1 [11869, 12227]      +   |   processed_transcript
        [2]        1 [12613, 12721]      +   |   processed_transcript
        [3]        1 [13221, 14409]      +   |   processed_transcript
        [4]        1 [11872, 12227]      +   | unprocessed_pseudogene
        [5]        1 [12613, 12721]      +   | unprocessed_pseudogene
        ...      ...            ...    ... ...                    ...
  [2268085]       MT [14747, 15887]      +   |         protein_coding
  [2268086]       MT [14747, 15887]      +   |         protein_coding
  [2268087]       MT [14747, 14749]      +   |         protein_coding
  [2268088]       MT [15888, 15953]      +   |                Mt_tRNA
  [2268089]       MT [15956, 16023]      -   |                Mt_tRNA
                   type     score     phase         gene_id   transcript_id
               <factor> <numeric> <integer>     <character>     <character>
        [1]        exon      <NA>      <NA> ENSG00000223972 ENST00000456328
        [2]        exon      <NA>      <NA> ENSG00000223972 ENST00000456328
        [3]        exon      <NA>      <NA> ENSG00000223972 ENST00000456328
        [4]        exon      <NA>      <NA> ENSG00000223972 ENST00000515242
        [5]        exon      <NA>      <NA> ENSG00000223972 ENST00000515242
        ...         ...       ...       ...             ...             ...
  [2268085]        exon      <NA>      <NA> ENSG00000198727 ENST00000361789
  [2268086]         CDS      <NA>         0 ENSG00000198727 ENST00000361789
  [2268087] start_codon      <NA>         0 ENSG00000198727 ENST00000361789
  [2268088]        exon      <NA>      <NA> ENSG00000210195 ENST00000387460
  [2268089]        exon      <NA>      <NA> ENSG00000210196 ENST00000387461
            exon_number   gene_name   gene_biotype transcript_name
              <numeric> <character>    <character>     <character>
        [1]           1     DDX11L1     pseudogene     DDX11L1-002
        [2]           2     DDX11L1     pseudogene     DDX11L1-002
        [3]           3     DDX11L1     pseudogene     DDX11L1-002
        [4]           1     DDX11L1     pseudogene     DDX11L1-201
        [5]           2     DDX11L1     pseudogene     DDX11L1-201
        ...         ...         ...            ...             ...
  [2268085]           1      MT-CYB protein_coding      MT-CYB-201
  [2268086]           1      MT-CYB protein_coding      MT-CYB-201
  [2268087]           1      MT-CYB protein_coding      MT-CYB-201
  [2268088]           1       MT-TT        Mt_tRNA       MT-TT-201
  [2268089]           1       MT-TP        Mt_tRNA       MT-TP-201
                    exon_id      protein_id
                <character>     <character>
        [1] ENSE00002234944            <NA>
        [2] ENSE00003582793            <NA>
        [3] ENSE00002312635            <NA>
        [4] ENSE00002234632            <NA>
        [5] ENSE00003608237            <NA>
        ...             ...             ...
  [2268085] ENSE00001436074            <NA>
  [2268086]            <NA> ENSP00000354554
  [2268087]            <NA>            <NA>
  [2268088] ENSE00001544475            <NA>
  [2268089] ENSE00001544473            <NA>
  ---
  seqlengths:
                     1                   2 ...                  MT
                    NA                  NA ...                  NA