[Bioc-devel] exonsBy dropping genes from TxDb

Hervé Pagès hpages at fredhutch.org
Sat Oct 28 05:34:29 CEST 2017


The problem seems to originate in the BioMart service itself:

   library(biomaRt)
   mart <- useMart(biomart="ensembl", dataset="hsapiens_gene_ensembl")

   attributes1 <- c("ensembl_transcript_id", "ensembl_exon_id")
   df1 <- getBM(attributes1, mart=mart)

   dim(df1)
   # [1] 1150974       2

   subset(df1, ensembl_transcript_id == "ENST00000485971")
   #         ensembl_transcript_id ensembl_exon_id
   # 1078905       ENST00000485971 ENSE00001952391
   # 1078906       ENST00000485971 ENSE00003639486
   # 1078907       ENST00000485971 ENSE00001881546

Correct. ENST00000485971 has 3 exons.

But if I request more attributes:

   attributes2 <- c(attributes1, "rank", "genomic_coding_start", 
"cds_start", "5_utr_start")
   df2 <- getBM(attributes2, mart=mart)

   dim(df2)
   # [1] 1051158       6

Uh, less rows?

   subset(df2, ensembl_transcript_id == "ENST00000485971")
   # [1] ensembl_transcript_id ensembl_exon_id       rank
   # [4] genomic_coding_start  cds_start             5_utr_start
<0 rows> (or 0-length row.names)

ENST00000485971 disappeared from the result!

It's like someone replaced a LEFT JOIN with an INNER JOIN somewhere
in the BioMart service.

The impact of this on makeTxDbFrombiomart() is that many exons indeed
don't make it into the TxDb object so we end up with many exon-less
transcripts. I'll add a sanity check and will error when this happens. 
Better fail with an informative error message than silently return
a broken TxDb.

Before I contact the BioMart people about this, I'd like to remove R
and the biomaRt package from the equation but I'm not sure how to do
that (I was not able to reproduce the problem by using their web
interface https://www.ensembl.org/biomart). Any help with this would
be greatly appreciated.

Thanks,
H.

 > sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS

Matrix products: default
BLAS: /home/hpages/R/R-3.4.2-bioc35/lib/libRblas.so
LAPACK: /home/hpages/R/R-3.4.2-bioc35/lib/libRlapack.so

locale:
  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
  [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] biomaRt_2.32.1       BiocInstaller_1.26.1

loaded via a namespace (and not attached):
  [1] Rcpp_0.12.13         IRanges_2.10.5       XML_3.98-1.9
  [4] digest_0.6.12        bitops_1.0-6         DBI_0.7
  [7] stats4_3.4.2         RSQLite_2.0          rlang_0.1.2
[10] blob_1.1.0           S4Vectors_0.14.7     tools_3.4.2
[13] bit64_0.9-7          Biobase_2.36.2       RCurl_1.95-4.8
[16] bit_1.1-12           parallel_3.4.2       compiler_3.4.2
[19] BiocGenerics_0.22.1  AnnotationDbi_1.38.2 memoise_1.1.0
[22] tibble_1.3.4


On 10/27/2017 10:05 AM, Hervé Pagès wrote:
> Hi Leonard,
>
> Sorry for missing your earlier posts about this. Will look into it.
>
> Thanks,
> H.
>
> On 10/27/2017 09:07 AM, Leonard Goldstein wrote:
>> Dear bioc-devel,
>>
>> I noticed exonsBy is dropping a lot of genes when run on a TxDb object
>> created with makeTxDbFromBiomart (see below). Please also see related
>> post
>> on the Bioconductor support site:
>>
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__support.bioconductor.org_p_101951_-23102160&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=ExVvVr6r4wUKgBmPNOQiJXSI2O1FYqFRz_llKQY_jD0&s=xUz41e2CIjMTXmO5LqkFp12ySvH9zzA0s5OMWzgWosU&e=
>>
>>
>> Thanks for your help.
>>
>> Leonard
>>
>> --
>>> tx <- makeTxDbFromBiomart()
>>> txs_by_gene <- transcriptsBy(tx, "gene")
>>> exs_by_gene <- exonsBy(tx, "gene")
>>> length(txs_by_gene)
>> [1] 63967
>>> length(exs_by_gene)
>> [1] 36751
>>> subsetByOverlaps(txs_by_gene, GRanges("8",
>>> IRanges(127735434,127741434)))
>> GRangesList object of length 1:
>> $ENSG00000136997
>> GRanges object with 9 ranges and 2 metadata columns:
>>        seqnames                 ranges strand |     tx_id         tx_name
>>           <Rle>              <IRanges>  <Rle> | <integer>     <character>
>>    [1]        8 [127735434, 127740477]      + |    101876 ENST00000259523
>>    [2]        8 [127735473, 127735817]      + |    101877 ENST00000641252
>>    [3]        8 [127735519, 127738772]      + |    101878 ENST00000517291
>>    [4]        8 [127736046, 127736612]      + |    101879 ENST00000641036
>>    [5]        8 [127736069, 127741434]      + |    101880 ENST00000621592
>>    [6]        8 [127736084, 127741434]      + |    101881 ENST00000377970
>>    [7]        8 [127736220, 127741372]      + |    101882 ENST00000524013
>>    [8]        8 [127736231, 127738475]      + |    101883 ENST00000520751
>>    [9]        8 [127736594, 127740958]      + |    101884 ENST00000613283
>>
>> -------
>> seqinfo: 555 sequences (1 circular) from an unspecified genome
>>> subsetByOverlaps(exs_by_gene, GRanges("8",
>>> IRanges(127735434,127741434)))
>> GRangesList object of length 0:
>> <0 elements>
>>
>> -------
>> seqinfo: 555 sequences (1 circular) from an unspecified genome
>>> sessionInfo()
>> R version 3.4.1 (2017-06-30)
>> Platform: x86_64-apple-darwin15.6.0 (64-bit)
>> Running under: OS X El Capitan 10.11.6
>>
>> Matrix products: default
>> BLAS:
>> /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
>>
>> LAPACK:
>> /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
>>
>>
>> locale:
>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>>
>> attached base packages:
>> [1] stats4    parallel  stats     graphics  grDevices utils     datasets
>> [8] methods   base
>>
>> other attached packages:
>> [1] GenomicFeatures_1.28.5 AnnotationDbi_1.38.2   Biobase_2.36.2
>> [4] GenomicRanges_1.28.6   GenomeInfoDb_1.12.3    IRanges_2.10.5
>> [7] S4Vectors_0.14.7       BiocGenerics_0.22.1
>>
>> loaded via a namespace (and not attached):
>>   [1] Rcpp_0.12.13               XVector_0.16.0
>>   [3] GenomicAlignments_1.12.2   zlibbioc_1.22.0
>>   [5] BiocParallel_1.10.1        bit_1.1-12
>>   [7] lattice_0.20-35            rlang_0.1.2
>>   [9] blob_1.1.0                 tools_3.4.1
>> [11] grid_3.4.1                 SummarizedExperiment_1.6.5
>> [13] DBI_0.7                    matrixStats_0.52.2
>> [15] bit64_0.9-7                digest_0.6.12
>> [17] tibble_1.3.4               Matrix_1.2-11
>> [19] GenomeInfoDbData_0.99.0    rtracklayer_1.36.6
>> [21] bitops_1.0-6               RCurl_1.95-4.8
>> [23] biomaRt_2.32.1             memoise_1.1.0
>> [25] RSQLite_2.0                DelayedArray_0.2.7
>> [27] compiler_3.4.1             Rsamtools_1.28.0
>> [29] Biostrings_2.44.2          XML_3.98-1.9
>> [31] pkgconfig_2.0.1
>>>
>>
>>     [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=ExVvVr6r4wUKgBmPNOQiJXSI2O1FYqFRz_llKQY_jD0&s=8iOGgTunhaWEQ1FdrTFYVPCHLcoDa2DDO2pAluGQrOQ&e=
>>
>>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-devel mailing list