[Bioc-devel] makeTranscriptDbFrom... AnnotationHub
Hervé Pagès
hpages at fhcrc.org
Tue Jul 8 21:45:48 CEST 2014
Hi Michael,
On 07/08/2014 12:11 PM, Michael Love wrote:
> The recent TranscriptDb thread reminded me of a question: are there
> plans (or am I missing the function) to easily get a TranscriptDb out
> of the AnnotationHub objects? It would be great to have a preprocessed
> Ensembl txdb like we have for UCSC.
I think the 1st thing we should do is have a
makeTranscriptDbFromGRanges() function. It should not be too hard
because we already have the code :) Marc wrote it. But it's currently
part of the makeTranscriptDbFromGFF() function. Roughly speaking this
function does 2 things: (1) import the GFF or GTF file as a GRanges
object, then (2) turn that GRanges object into a TranscriptDb object.
So we should move the code that does (2) into a separate function,
the makeTranscriptDbFromGRanges() function, and have
makeTranscriptDbFromGFF() call it internally.
Then you could call makeTranscriptDbFromGRanges() on any of these
GFF- or GTF-based GRanges objects you get from AnnotationHub.
We'll work on this soon and announce here when it becomes available.
Cheers,
H.
>
>> ah <- AnnotationHub()
>> gr <- ah$ensembl.release.73.gtf.homo_sapiens.Homo_sapiens.GRCh37.73.gtf_0.0.1.RData
>> gr
> GRanges with 2268089 ranges and 12 metadata columns:
> seqnames ranges strand | source
> <Rle> <IRanges> <Rle> | <factor>
> [1] 1 [11869, 12227] + | processed_transcript
> [2] 1 [12613, 12721] + | processed_transcript
> [3] 1 [13221, 14409] + | processed_transcript
> [4] 1 [11872, 12227] + | unprocessed_pseudogene
> [5] 1 [12613, 12721] + | unprocessed_pseudogene
> ... ... ... ... ... ...
> [2268085] MT [14747, 15887] + | protein_coding
> [2268086] MT [14747, 15887] + | protein_coding
> [2268087] MT [14747, 14749] + | protein_coding
> [2268088] MT [15888, 15953] + | Mt_tRNA
> [2268089] MT [15956, 16023] - | Mt_tRNA
> type score phase gene_id transcript_id
> <factor> <numeric> <integer> <character> <character>
> [1] exon <NA> <NA> ENSG00000223972 ENST00000456328
> [2] exon <NA> <NA> ENSG00000223972 ENST00000456328
> [3] exon <NA> <NA> ENSG00000223972 ENST00000456328
> [4] exon <NA> <NA> ENSG00000223972 ENST00000515242
> [5] exon <NA> <NA> ENSG00000223972 ENST00000515242
> ... ... ... ... ... ...
> [2268085] exon <NA> <NA> ENSG00000198727 ENST00000361789
> [2268086] CDS <NA> 0 ENSG00000198727 ENST00000361789
> [2268087] start_codon <NA> 0 ENSG00000198727 ENST00000361789
> [2268088] exon <NA> <NA> ENSG00000210195 ENST00000387460
> [2268089] exon <NA> <NA> ENSG00000210196 ENST00000387461
> exon_number gene_name gene_biotype transcript_name
> <numeric> <character> <character> <character>
> [1] 1 DDX11L1 pseudogene DDX11L1-002
> [2] 2 DDX11L1 pseudogene DDX11L1-002
> [3] 3 DDX11L1 pseudogene DDX11L1-002
> [4] 1 DDX11L1 pseudogene DDX11L1-201
> [5] 2 DDX11L1 pseudogene DDX11L1-201
> ... ... ... ... ...
> [2268085] 1 MT-CYB protein_coding MT-CYB-201
> [2268086] 1 MT-CYB protein_coding MT-CYB-201
> [2268087] 1 MT-CYB protein_coding MT-CYB-201
> [2268088] 1 MT-TT Mt_tRNA MT-TT-201
> [2268089] 1 MT-TP Mt_tRNA MT-TP-201
> exon_id protein_id
> <character> <character>
> [1] ENSE00002234944 <NA>
> [2] ENSE00003582793 <NA>
> [3] ENSE00002312635 <NA>
> [4] ENSE00002234632 <NA>
> [5] ENSE00003608237 <NA>
> ... ... ...
> [2268085] ENSE00001436074 <NA>
> [2268086] <NA> ENSP00000354554
> [2268087] <NA> <NA>
> [2268088] ENSE00001544475 <NA>
> [2268089] ENSE00001544473 <NA>
> ---
> seqlengths:
> 1 2 ... MT
> NA NA ... NA
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioc-devel
mailing list