[Bioc-devel] exonsBy dropping genes from TxDb

Leonard Goldstein goldstein.leonard at gene.com
Mon Oct 30 19:58:30 CET 2017


Thanks Hervé and others for looking into this.

Leonard

On Sun, Oct 29, 2017 at 10:19 PM, Hervé Pagès <hpages at fredhutch.org> wrote:

> Thanks Dario and Mike for looking into this.
>
> In the mean time I added makeTxDbFromEnsembl() for creating a TxDb
> object by querying directly an Ensembl MySQL server. Only lightly
> tested but so far seems to faster and more reliable than
> makeTxDbFromBiomart():
>
>   > library(GenomicFeatures)
>
>   > system.time(txdb <- makeTxDbFromEnsembl("Homo sapiens", server="
> useastdb.ensembl.org"))
>   Fetch transcripts and genes from Ensembl ... OK
>   Fetch exons and CDS from Ensembl ... OK
>   Fetch chromosome names and lengths from Ensembl ...OK
>   Gather the metadata ... OK
>   Make the TxDb object ... OK
>      user  system elapsed
>    46.420   1.271  58.270
>
>   > txdb
>   TxDb object:
>   # Db type: TxDb
>   # Supporting package: GenomicFeatures
>   # Data source: Ensembl
>   # Organism: Homo sapiens
>   # Ensembl release: 90
>   # Ensembl database: homo_sapiens_core_90_38
>   # MySQL server: useastdb.ensembl.org
>   # transcript_nrow: 220144
>   # exon_nrow: 757782
>   # cds_nrow: 789614
>   # Db created by: GenomicFeatures package from Bioconductor
>   # Creation time: 2017-10-29 22:13:21 -0700 (Sun, 29 Oct 2017)
>   # GenomicFeatures version at creation time: 1.29.16
>   # RSQLite version at creation time: 2.0
>   # DBSCHEMAVERSION: 1.2
>
> It's in GenomicFeatures 1.29.16.
>
> Cheers,
> H.
>
>
> On 10/28/2017 03:32 PM, Mike Smith wrote:
>
>> My feeling is that this isn't related to RCurl, since the example
>> transcript is missing if you use httr to submit the query instead.  You
>> can
>> check this with the code at
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__gist.gi
>> thub.com_grimbough_7e7a47b7a4f64915220ce35cc1ce8f39&d=
>> DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0W
>> YiZvSXAJJKaaPhzWA&m=njpXf5TkV-BLhepKQRWVTdAzBmWWCsDj7RIPgs7F
>> WBM&s=orkmK9y7kVCLbLB6tlMJfgT5V_GKzVysZ00DMevuAm4&e=
>>
>>
>> I wonder if this is related to BioMart's instability when you submit large
>> queries.  The web interface suggests no more than 500 entries when
>> filtering on something like a gene ID, but in these examples we're
>> applying
>> no filter at all.  Problems related to this have been reported before (
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__support
>> .bioconductor.org_p_86358_&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfh
>> Q&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=njpXf5TkV-
>> BLhepKQRWVTdAzBmWWCsDj7RIPgs7FWBM&s=CxVV4WArQc4xcf74Az8V8vfQ
>> Mr80Q8K2um6sKTggHw0&e=) and I patched the devel version
>>
>> of biomaRt to submit queries in batches of 500 - although this currently
>> has no effect if there isn't a set of 'values' to split into batches.
>>
>> Interestingly, if I first obtain a list of all gene IDs in Ensembl, and
>> then submit those in a batched query I get more exons than your first
>> approach.
>>
>> all_genes <- getBM(attributes = "ensembl_gene_id", mart=mart)
>> attributes1 <- c("ensembl_transcript_id", "ensembl_exon_id")
>> attributes2 <- c(attributes1, "rank", "genomic_coding_start", "cds_start",
>> "5_utr_start")
>> df3 <- getBM(attributes1, filters = "ensembl_gene_id", values =
>> all_genes[,1], mart=mart)
>> df4 <- getBM(attributes2, filters = "ensembl_gene_id", values =
>> all_genes[,1], mart=mart)
>>
>> dim(df3)
>>>
>> [1] 1309356       6
>>
>> The number of returned results is consistent for even with the larger
>> number of attributes
>>
>> identical(df3[,1], df4[,1])
>>>
>> [1] TRUE
>>
>> And some of these transcripts are clearly not present in the first query:
>>
>> length(which( !unique(df3[,"ensembl_transcript_id"]) %in% df1[,
>>>
>> "ensembl_transcript_id"] ))
>> [1] 28128
>>
>> It could be that my batched code is doing something wrong, but I'm
>> starting
>> to suspect that even the first query is dropping data silently!
>>
>> Mike
>>
>>
>> On 28 October 2017 at 13:00, Dario Strbenac <dstr7320 at uni.sydney.edu.au>
>> wrote:
>>
>> Good day,
>>>
>>> I stepped through the code until execution reached the end of postForm in
>>> RCurl which is called by getBM and obtains the textual result from the
>>> server. If I check the contents of write$value(), the example missing
>>> transcript is not there.
>>>
>>> Browse[3]> grep("ENST00000485971", write$value())
>>> integer(0)
>>>
>>> write$value is a weird function. It's prototype is function (collapse =
>>> "", ...) but its body contains code such as
>>>
>>>      if (is.null(collapse))
>>>          return(txt)
>>>
>>> I wonder where txt is created. It's not passed as an extra variable.
>>>
>>> Browse[7]> print(list(...))
>>> list()
>>>
>>> Searching the R code reveals that txt is created as a global variable in
>>> another function named dynCurlReader by the code statement txt <<-
>>> character().
>>>
>>> RCurl also uses functions that don't begin with a dot but are
>>> undocumented.
>>>
>>> ans = encode(ans)
>>> Browse[7]> ?encode
>>> No documentation for ‘encode’ in specified packages and libraries
>>>
>>> Anyway, the transcript ID is also missing from txt.
>>>
>>> Browse[7]> grep("ENST00000485971", txt)
>>> integer(0)
>>>
>>> It's hard to know what the obfuscated code of RCurl is doing.
>>>
>>> --------------------------------------
>>> Dario Strbenac
>>> University of Sydney
>>> Camperdown NSW 2050
>>> Australia
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.et
>>> hz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt
>>> 84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=nj
>>> pXf5TkV-BLhepKQRWVTdAzBmWWCsDj7RIPgs7FWBM&s=1qsHGdbNcXIIZCJD
>>> MeyF5e0R7MZWxNlxOxbXiH1zLGA&e=
>>>
>>>
>>         [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.et
>> hz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt
>> 84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=nj
>> pXf5TkV-BLhepKQRWVTdAzBmWWCsDj7RIPgs7FWBM&s=1qsHGdbNcXIIZCJD
>> MeyF5e0R7MZWxNlxOxbXiH1zLGA&e=
>>
>>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fredhutch.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list