[Bioc-devel] makeTranscriptDbFrom... AnnotationHub

Tue Jul 8 21:45:48 CEST 2014

Hi Michael,

On 07/08/2014 12:11 PM, Michael Love wrote:
> The recent TranscriptDb thread reminded me of a question: are there
> plans (or am I missing the function) to easily get a TranscriptDb out
> of the AnnotationHub objects? It would be great to have a preprocessed
> Ensembl txdb like we have for UCSC.

I think the 1st thing we should do is have a 
makeTranscriptDbFromGRanges() function. It should not be too hard
because we already have the code :) Marc wrote it. But it's currently
part of the makeTranscriptDbFromGFF() function. Roughly speaking this
function does 2 things: (1) import the GFF or GTF file as a GRanges
object, then (2) turn that GRanges object into a TranscriptDb object.
So we should move the code that does (2) into a separate function,
the makeTranscriptDbFromGRanges() function, and have
makeTranscriptDbFromGFF() call it internally.

Then you could call makeTranscriptDbFromGRanges() on any of these
GFF- or GTF-based GRanges objects you get from AnnotationHub.

We'll work on this soon and announce here when it becomes available.

Cheers,
H.

>
>> ah <- AnnotationHub()
>> gr <- ah$ensembl.release.73.gtf.homo_sapiens.Homo_sapiens.GRCh37.73.gtf_0.0.1.RData
>> gr
> GRanges with 2268089 ranges and 12 metadata columns:
>              seqnames         ranges strand   |                 source
>                 <Rle>      <IRanges>  <Rle>   |               <factor>
>          [1]        1 [11869, 12227]      +   |   processed_transcript
>          [2]        1 [12613, 12721]      +   |   processed_transcript
>          [3]        1 [13221, 14409]      +   |   processed_transcript
>          [4]        1 [11872, 12227]      +   | unprocessed_pseudogene
>          [5]        1 [12613, 12721]      +   | unprocessed_pseudogene
>          ...      ...            ...    ... ...                    ...
>    [2268085]       MT [14747, 15887]      +   |         protein_coding
>    [2268086]       MT [14747, 15887]      +   |         protein_coding
>    [2268087]       MT [14747, 14749]      +   |         protein_coding
>    [2268088]       MT [15888, 15953]      +   |                Mt_tRNA
>    [2268089]       MT [15956, 16023]      -   |                Mt_tRNA
>                     type     score     phase         gene_id   transcript_id
>                 <factor> <numeric> <integer>     <character>     <character>
>          [1]        exon      <NA>      <NA> ENSG00000223972 ENST00000456328
>          [2]        exon      <NA>      <NA> ENSG00000223972 ENST00000456328
>          [3]        exon      <NA>      <NA> ENSG00000223972 ENST00000456328
>          [4]        exon      <NA>      <NA> ENSG00000223972 ENST00000515242
>          [5]        exon      <NA>      <NA> ENSG00000223972 ENST00000515242
>          ...         ...       ...       ...             ...             ...
>    [2268085]        exon      <NA>      <NA> ENSG00000198727 ENST00000361789
>    [2268086]         CDS      <NA>         0 ENSG00000198727 ENST00000361789
>    [2268087] start_codon      <NA>         0 ENSG00000198727 ENST00000361789
>    [2268088]        exon      <NA>      <NA> ENSG00000210195 ENST00000387460
>    [2268089]        exon      <NA>      <NA> ENSG00000210196 ENST00000387461
>              exon_number   gene_name   gene_biotype transcript_name
>                <numeric> <character>    <character>     <character>
>          [1]           1     DDX11L1     pseudogene     DDX11L1-002
>          [2]           2     DDX11L1     pseudogene     DDX11L1-002
>          [3]           3     DDX11L1     pseudogene     DDX11L1-002
>          [4]           1     DDX11L1     pseudogene     DDX11L1-201
>          [5]           2     DDX11L1     pseudogene     DDX11L1-201
>          ...         ...         ...            ...             ...
>    [2268085]           1      MT-CYB protein_coding      MT-CYB-201
>    [2268086]           1      MT-CYB protein_coding      MT-CYB-201
>    [2268087]           1      MT-CYB protein_coding      MT-CYB-201
>    [2268088]           1       MT-TT        Mt_tRNA       MT-TT-201
>    [2268089]           1       MT-TP        Mt_tRNA       MT-TP-201
>                      exon_id      protein_id
>                  <character>     <character>
>          [1] ENSE00002234944            <NA>
>          [2] ENSE00003582793            <NA>
>          [3] ENSE00002312635            <NA>
>          [4] ENSE00002234632            <NA>
>          [5] ENSE00003608237            <NA>
>          ...             ...             ...
>    [2268085] ENSE00001436074            <NA>
>    [2268086]            <NA> ENSP00000354554
>    [2268087]            <NA>            <NA>
>    [2268088] ENSE00001544475            <NA>
>    [2268089] ENSE00001544473            <NA>
>    ---
>    seqlengths:
>                       1                   2 ...                  MT
>                      NA                  NA ...                  NA
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319