[BioC] [Engineers for ensemblgenomes.org #251937] BioMart data anomaly: for some transcripts, the cds cumulative length inferred from the exon and UTR info doesn't match the "cds_length" attribute from BioMart
Cook, Malcolm
MEC at stowers.org
Tue Jun 19 16:35:18 CEST 2012
Hi,
I am chiming in as the original reporter, and cc:ing Herve Pages from the
BioConductor project who was instrumental in providing diagnostic feedback
and coded much of the inner workings of the 'R' part.
When I now follow the steps I originally reported, now using today's
biomart (Ensembl 67), I find that transcripts are still identified having
the reported anomaly.
However, for my purposes, I now find the problem greatly ameliorated in
that:
there are only 5 such
they are all in the same alternatively spliced gene
the BioConductor package now more gracefully raises a warning with a
detailed report instead an error.
I believe that examining the detailed report, included in my transcript
below, will reveal the remaining root cause to you.
Thanks for following up! I hope this helps, and am looking forward to
ticket closed on this one!
~ Malcolm Cook
$ R
# use the package (assuming it and dependencies are installed)
library(GenomicFeatures)
# and try to build the TranscriptDb (expect error/warning here)
txdb<-makeTranscriptDbFromBiomart(biomart="ensembl",
dataset="dmelanogaster_gene_ensembl", circ_seqs=NULL))
Download and preprocess the 'transcripts' data frame ... OK
Download and preprocess the 'chrominfo' data frame ... OK
Download and preprocess the 'splicings' data frame ... OK
Download and preprocess the 'genes' data frame ... OK
Prepare the 'metadata' data frame ... OK
metadata: OK
Make the TranscriptDb object ... OK
Warning message:
In .warningWithBioMartDataAnomalyReport(bm_table, idx, id_prefix, :
BioMart data anomaly: in the following transcripts,
the CDS total length inferred from the exon and UTR info
doesn't match the "cds_length" attribute from BioMart.
1. Transcript FBtr0084080:
strand rank exon_chrom_start exon_chrom_end ensembl_exon_id
5_utr_start 5_utr_end 3_utr_start 3_utr_end cds_length
1 -1 1 17203010 17203121 FBgn0002781:30
17203010 17203121 NA NA 887
2 -1 2 17202541 17202798 FBgn0002781:29
17202749 17202798 NA NA 887
3 -1 3 17202324 17202463 FBgn0002781:28-A
NA NA NA NA 887
4 -1 4 17195184 17195967 FBgn0002781:39
NA NA 17195184 17195428 887
5 -1 5 17200782 17201634 FBgn0002781:27-B
NA NA NA NA 887
2. Transcript FBtr0084077:
strand rank exon_chrom_start exon_chrom_end ensembl_exon_id
5_utr_start 5_utr_end 3_utr_start 3_utr_end cds_length
1 -1 3 17203010 17203121 FBgn0002781:30
17203010 17203121 NA NA -213
2 -1 4 17202541 17202798 FBgn0002781:29
17202755 17202798 NA NA -213
3 -1 1 17202324 17202463 FBgn0002781:28-B
NA NA NA NA -213
4 -1 2 17177331 17177608 FBgn0002781:1
NA NA 17177331 17177387 -213
5 -1 5 17200782 17201634 FBgn0002781:27-A
NA NA NA NA -213
3. Transcript FBtr0084082:
strand rank exon_chrom_start exon_chrom_end ensembl_exon_id
5_utr_start 5_utr_end 3_utr_start 3_utr_end cds_length
1 -1 3 17203010 17203121 FBgn0002781:30
17203010 17203121 NA NA -466
2 -1 4 17202541 17202798 FBgn0002781:29
17202749 17202798 NA NA -466
3 -1 1 17202324 17202463 FBgn0002781:28-B
NA NA NA NA -466
4 -1 5 17200782 17201634 FBgn0002781:27-A
NA NA NA NA -466
5 -1 2 17193632 17193960 FBgn0002781:37
NA NA 17193632 17193935 -466
4. Transcript FBtr0084079:
strand rank exon_chrom_start exon_chrom_end ensembl_exon_id
5_utr_start 5_utr_end 3_utr_start 3_utr_end cds_length
1 -1 1 17203010 17203121 FBgn0002781:30
17203010 17203121 NA NA 1572
2 -1 2 17202541 17202798 FBgn0002781:29
17202749 17202798 NA NA 1572
3 -1 3 17202324 17202463 FBgn0002781:28-A
NA NA NA NA 1572
4 -1 4 17200782 17201634 FBgn0002781:27-B
NA NA NA NA 1572
5 -1 5 17186112 17186276 FBgn0002781:31
NA NA 17186112 17186276 1572
6 -1 6 17186350 17187009 FBgn0002781:32
NA NA 17186350 17186803 1572
5. Transcript FBtr0084085:
strand rank exon_chrom_start exon_chrom_end ensembl_exon_id
5_utr_start 5_utr_end 3_utr_start 3_utr_end cds_length
1 -1 1 17203010 17203121 FBgn0002781:30
17203010 17203121 NA NA 1729
2 -1 2 17202541 17202798 FBgn0002781:29
17202749 17202798 NA NA 1729
3 -1 3 17202324 17202463 FBgn0002781:28-A
NA NA NA NA 1729
4 -1 4 17200782 17201634 FBgn0002781:27-B
NA NA NA NA 1729
5 -1 5 17187120 17187332 FBgn0002781:33
NA NA 17187120 17187332 1729
6 -1 6 17187392 17187860 FBgn0002781:34
NA NA 17187392 17187545 1729
# show off the txdb's metadata
> txdb
TranscriptDb object:
| Db type: TranscriptDb
| Supporting package: GenomicFeatures
| Data source: BioMart
| Genus and Species: Drosophila melanogaster
| Resource URL: www.biomart.org:80
| BioMart database: ensembl
| BioMart database version: ENSEMBL GENES 67 (SANGER UK)
| BioMart dataset: dmelanogaster_gene_ensembl
| BioMart dataset description: Drosophila melanogaster genes (BDGP5)
| BioMart dataset version: BDGP5
| Full dataset: yes
| miRBase build ID: NA
| transcript_nrow: 25415
| exon_nrow: 74818
| cds_nrow: 62601
| Db created by: GenomicFeatures package from Bioconductor
| Creation time: 2012-06-19 09:13:33 -0500 (Tue, 19 Jun 2012)
| GenomicFeatures version at creation time: 1.8.1
| RSQLite version at creation time: 0.11.1
| DBSCHEMAVERSION: 1.0
# show off details about the version of R and libraries used.
> sessionInfo()
R version 2.15.0 (2012-03-30)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
locale:
[1] C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] GenomicFeatures_1.8.1 AnnotationDbi_1.18.1 Biobase_2.16.0
GenomicRanges_1.8.6 IRanges_1.14.3 BiocGenerics_0.2.0
BiocInstaller_1.4.6
loaded via a namespace (and not attached):
[1] BSgenome_1.24.0 Biostrings_2.24.1 DBI_0.2-5 RCurl_1.91-1
RSQLite_0.11.1 Rsamtools_1.8.5 XML_3.9-4
biomaRt_2.12.0 bitops_1.0-4.1 rtracklayer_1.16.1 stats4_2.15.0
tools_2.15.0 zlibbioc_1.2.0
>
On 6/19/12 8:35 AM, "kmegy at ebi.ac.uk via RT" <helpdesk at ensemblgenomes.org>
wrote:
>Which species was this again? Drosophila?
>
>I fixed something about STOP codons for Droso., but it's probably not
>what he is talking about.
>
>
>On 19 Jun 2012, at 14:32, Dan Staines wrote:
>
>> I believe that Karyn fixed this but Dan L & co are probably in a better
>>position to comment.
>>
>> On 06/19/2012 01:36 PM, Bert Overduin via RT wrote:
>>> Hi Dan,
>>>
>>> Has this been fixed in EG14?
>>>
>>> Cheers,
>>> Bert
>>>
>>> On Sun, Apr 15, 2012 at 5:56 PM, Dan Staines via RT
>>> <helpdesk at ensemblgenomes.org> wrote:
>>>> Hi Malcolm,
>>>>
>>>> I've just asked for an update on this. Fixes that we've applied
>>>>recently do not
>>>> unfortunately appear to fix the issue. However, we're continuing to
>>>>investigate
>>>> how to fix this and are aiming for a fix for EG14 in May.
>>>>
>>>> Best,
>>>>
>>>> Dan.
>>>>
>>>> .
>>>>
>>>> --
>>>> Ticket Details<URL:
>>>>https://rt.sanger.ac.uk/SelfService/Display.html?id=251937>
>>>>
>>>>
>>>> --
>>>> The Wellcome Trust Sanger Institute is operated by Genome Research
>>>> Limited, a charity registered in England with number 1021457 and a
>>>> company registered in England with number 2742969, whose registered
>>>> office is 215 Euston Road, London, NW1 2BE.
>>>
>>>
>>>
>>
>> --
>> Dan Staines, PhD Ensembl Genomes Technical Coordinator
>> EMBL-EBI Tel: +44-(0)1223-492507
>> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468
>> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/
>
>
>
>--
>Ticket Details <URL:
>https://rt.sanger.ac.uk/SelfService/Display.html?id=251937 >
>
>
>--
> The Wellcome Trust Sanger Institute is operated by Genome Research
> Limited, a charity registered in England with number 1021457 and a
> company registered in England with number 2742969, whose registered
> office is 215 Euston Road, London, NW1 2BE.
More information about the Bioconductor
mailing list