[Bioc-devel] exonsBy dropping genes from TxDb
Hervé Pagès
hpages at fredhutch.org
Sat Oct 28 05:34:29 CEST 2017
The problem seems to originate in the BioMart service itself:
library(biomaRt)
mart <- useMart(biomart="ensembl", dataset="hsapiens_gene_ensembl")
attributes1 <- c("ensembl_transcript_id", "ensembl_exon_id")
df1 <- getBM(attributes1, mart=mart)
dim(df1)
# [1] 1150974 2
subset(df1, ensembl_transcript_id == "ENST00000485971")
# ensembl_transcript_id ensembl_exon_id
# 1078905 ENST00000485971 ENSE00001952391
# 1078906 ENST00000485971 ENSE00003639486
# 1078907 ENST00000485971 ENSE00001881546
Correct. ENST00000485971 has 3 exons.
But if I request more attributes:
attributes2 <- c(attributes1, "rank", "genomic_coding_start",
"cds_start", "5_utr_start")
df2 <- getBM(attributes2, mart=mart)
dim(df2)
# [1] 1051158 6
Uh, less rows?
subset(df2, ensembl_transcript_id == "ENST00000485971")
# [1] ensembl_transcript_id ensembl_exon_id rank
# [4] genomic_coding_start cds_start 5_utr_start
<0 rows> (or 0-length row.names)
ENST00000485971 disappeared from the result!
It's like someone replaced a LEFT JOIN with an INNER JOIN somewhere
in the BioMart service.
The impact of this on makeTxDbFrombiomart() is that many exons indeed
don't make it into the TxDb object so we end up with many exon-less
transcripts. I'll add a sanity check and will error when this happens.
Better fail with an informative error message than silently return
a broken TxDb.
Before I contact the BioMart people about this, I'd like to remove R
and the biomaRt package from the equation but I'm not sure how to do
that (I was not able to reproduce the problem by using their web
interface https://www.ensembl.org/biomart). Any help with this would
be greatly appreciated.
Thanks,
H.
> sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS
Matrix products: default
BLAS: /home/hpages/R/R-3.4.2-bioc35/lib/libRblas.so
LAPACK: /home/hpages/R/R-3.4.2-bioc35/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] biomaRt_2.32.1 BiocInstaller_1.26.1
loaded via a namespace (and not attached):
[1] Rcpp_0.12.13 IRanges_2.10.5 XML_3.98-1.9
[4] digest_0.6.12 bitops_1.0-6 DBI_0.7
[7] stats4_3.4.2 RSQLite_2.0 rlang_0.1.2
[10] blob_1.1.0 S4Vectors_0.14.7 tools_3.4.2
[13] bit64_0.9-7 Biobase_2.36.2 RCurl_1.95-4.8
[16] bit_1.1-12 parallel_3.4.2 compiler_3.4.2
[19] BiocGenerics_0.22.1 AnnotationDbi_1.38.2 memoise_1.1.0
[22] tibble_1.3.4
On 10/27/2017 10:05 AM, Hervé Pagès wrote:
> Hi Leonard,
>
> Sorry for missing your earlier posts about this. Will look into it.
>
> Thanks,
> H.
>
> On 10/27/2017 09:07 AM, Leonard Goldstein wrote:
>> Dear bioc-devel,
>>
>> I noticed exonsBy is dropping a lot of genes when run on a TxDb object
>> created with makeTxDbFromBiomart (see below). Please also see related
>> post
>> on the Bioconductor support site:
>>
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__support.bioconductor.org_p_101951_-23102160&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=ExVvVr6r4wUKgBmPNOQiJXSI2O1FYqFRz_llKQY_jD0&s=xUz41e2CIjMTXmO5LqkFp12ySvH9zzA0s5OMWzgWosU&e=
>>
>>
>> Thanks for your help.
>>
>> Leonard
>>
>> --
>>> tx <- makeTxDbFromBiomart()
>>> txs_by_gene <- transcriptsBy(tx, "gene")
>>> exs_by_gene <- exonsBy(tx, "gene")
>>> length(txs_by_gene)
>> [1] 63967
>>> length(exs_by_gene)
>> [1] 36751
>>> subsetByOverlaps(txs_by_gene, GRanges("8",
>>> IRanges(127735434,127741434)))
>> GRangesList object of length 1:
>> $ENSG00000136997
>> GRanges object with 9 ranges and 2 metadata columns:
>> seqnames ranges strand | tx_id tx_name
>> <Rle> <IRanges> <Rle> | <integer> <character>
>> [1] 8 [127735434, 127740477] + | 101876 ENST00000259523
>> [2] 8 [127735473, 127735817] + | 101877 ENST00000641252
>> [3] 8 [127735519, 127738772] + | 101878 ENST00000517291
>> [4] 8 [127736046, 127736612] + | 101879 ENST00000641036
>> [5] 8 [127736069, 127741434] + | 101880 ENST00000621592
>> [6] 8 [127736084, 127741434] + | 101881 ENST00000377970
>> [7] 8 [127736220, 127741372] + | 101882 ENST00000524013
>> [8] 8 [127736231, 127738475] + | 101883 ENST00000520751
>> [9] 8 [127736594, 127740958] + | 101884 ENST00000613283
>>
>> -------
>> seqinfo: 555 sequences (1 circular) from an unspecified genome
>>> subsetByOverlaps(exs_by_gene, GRanges("8",
>>> IRanges(127735434,127741434)))
>> GRangesList object of length 0:
>> <0 elements>
>>
>> -------
>> seqinfo: 555 sequences (1 circular) from an unspecified genome
>>> sessionInfo()
>> R version 3.4.1 (2017-06-30)
>> Platform: x86_64-apple-darwin15.6.0 (64-bit)
>> Running under: OS X El Capitan 10.11.6
>>
>> Matrix products: default
>> BLAS:
>> /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
>>
>> LAPACK:
>> /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
>>
>>
>> locale:
>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>>
>> attached base packages:
>> [1] stats4 parallel stats graphics grDevices utils datasets
>> [8] methods base
>>
>> other attached packages:
>> [1] GenomicFeatures_1.28.5 AnnotationDbi_1.38.2 Biobase_2.36.2
>> [4] GenomicRanges_1.28.6 GenomeInfoDb_1.12.3 IRanges_2.10.5
>> [7] S4Vectors_0.14.7 BiocGenerics_0.22.1
>>
>> loaded via a namespace (and not attached):
>> [1] Rcpp_0.12.13 XVector_0.16.0
>> [3] GenomicAlignments_1.12.2 zlibbioc_1.22.0
>> [5] BiocParallel_1.10.1 bit_1.1-12
>> [7] lattice_0.20-35 rlang_0.1.2
>> [9] blob_1.1.0 tools_3.4.1
>> [11] grid_3.4.1 SummarizedExperiment_1.6.5
>> [13] DBI_0.7 matrixStats_0.52.2
>> [15] bit64_0.9-7 digest_0.6.12
>> [17] tibble_1.3.4 Matrix_1.2-11
>> [19] GenomeInfoDbData_0.99.0 rtracklayer_1.36.6
>> [21] bitops_1.0-6 RCurl_1.95-4.8
>> [23] biomaRt_2.32.1 memoise_1.1.0
>> [25] RSQLite_2.0 DelayedArray_0.2.7
>> [27] compiler_3.4.1 Rsamtools_1.28.0
>> [29] Biostrings_2.44.2 XML_3.98-1.9
>> [31] pkgconfig_2.0.1
>>>
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=ExVvVr6r4wUKgBmPNOQiJXSI2O1FYqFRz_llKQY_jD0&s=8iOGgTunhaWEQ1FdrTFYVPCHLcoDa2DDO2pAluGQrOQ&e=
>>
>>
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fredhutch.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioc-devel
mailing list