[Bioc-devel] exonsBy dropping genes from TxDb

Hervé Pagès hpages at fredhutch.org
Mon Oct 30 06:19:26 CET 2017


Thanks Dario and Mike for looking into this.

In the mean time I added makeTxDbFromEnsembl() for creating a TxDb
object by querying directly an Ensembl MySQL server. Only lightly
tested but so far seems to faster and more reliable than
makeTxDbFromBiomart():

   > library(GenomicFeatures)

   > system.time(txdb <- makeTxDbFromEnsembl("Homo sapiens", 
server="useastdb.ensembl.org"))
   Fetch transcripts and genes from Ensembl ... OK
   Fetch exons and CDS from Ensembl ... OK
   Fetch chromosome names and lengths from Ensembl ...OK
   Gather the metadata ... OK
   Make the TxDb object ... OK
      user  system elapsed
    46.420   1.271  58.270

   > txdb
   TxDb object:
   # Db type: TxDb
   # Supporting package: GenomicFeatures
   # Data source: Ensembl
   # Organism: Homo sapiens
   # Ensembl release: 90
   # Ensembl database: homo_sapiens_core_90_38
   # MySQL server: useastdb.ensembl.org
   # transcript_nrow: 220144
   # exon_nrow: 757782
   # cds_nrow: 789614
   # Db created by: GenomicFeatures package from Bioconductor
   # Creation time: 2017-10-29 22:13:21 -0700 (Sun, 29 Oct 2017)
   # GenomicFeatures version at creation time: 1.29.16
   # RSQLite version at creation time: 2.0
   # DBSCHEMAVERSION: 1.2

It's in GenomicFeatures 1.29.16.

Cheers,
H.


On 10/28/2017 03:32 PM, Mike Smith wrote:
> My feeling is that this isn't related to RCurl, since the example
> transcript is missing if you use httr to submit the query instead.  You can
> check this with the code at
> https://urldefense.proofpoint.com/v2/url?u=https-3A__gist.github.com_grimbough_7e7a47b7a4f64915220ce35cc1ce8f39&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=njpXf5TkV-BLhepKQRWVTdAzBmWWCsDj7RIPgs7FWBM&s=orkmK9y7kVCLbLB6tlMJfgT5V_GKzVysZ00DMevuAm4&e=
>
>
> I wonder if this is related to BioMart's instability when you submit large
> queries.  The web interface suggests no more than 500 entries when
> filtering on something like a gene ID, but in these examples we're applying
> no filter at all.  Problems related to this have been reported before (
> https://urldefense.proofpoint.com/v2/url?u=https-3A__support.bioconductor.org_p_86358_&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=njpXf5TkV-BLhepKQRWVTdAzBmWWCsDj7RIPgs7FWBM&s=CxVV4WArQc4xcf74Az8V8vfQMr80Q8K2um6sKTggHw0&e=) and I patched the devel version
> of biomaRt to submit queries in batches of 500 - although this currently
> has no effect if there isn't a set of 'values' to split into batches.
>
> Interestingly, if I first obtain a list of all gene IDs in Ensembl, and
> then submit those in a batched query I get more exons than your first
> approach.
>
> all_genes <- getBM(attributes = "ensembl_gene_id", mart=mart)
> attributes1 <- c("ensembl_transcript_id", "ensembl_exon_id")
> attributes2 <- c(attributes1, "rank", "genomic_coding_start", "cds_start",
> "5_utr_start")
> df3 <- getBM(attributes1, filters = "ensembl_gene_id", values =
> all_genes[,1], mart=mart)
> df4 <- getBM(attributes2, filters = "ensembl_gene_id", values =
> all_genes[,1], mart=mart)
>
>> dim(df3)
> [1] 1309356       6
>
> The number of returned results is consistent for even with the larger
> number of attributes
>
>> identical(df3[,1], df4[,1])
> [1] TRUE
>
> And some of these transcripts are clearly not present in the first query:
>
>> length(which( !unique(df3[,"ensembl_transcript_id"]) %in% df1[,
> "ensembl_transcript_id"] ))
> [1] 28128
>
> It could be that my batched code is doing something wrong, but I'm starting
> to suspect that even the first query is dropping data silently!
>
> Mike
>
>
> On 28 October 2017 at 13:00, Dario Strbenac <dstr7320 at uni.sydney.edu.au>
> wrote:
>
>> Good day,
>>
>> I stepped through the code until execution reached the end of postForm in
>> RCurl which is called by getBM and obtains the textual result from the
>> server. If I check the contents of write$value(), the example missing
>> transcript is not there.
>>
>> Browse[3]> grep("ENST00000485971", write$value())
>> integer(0)
>>
>> write$value is a weird function. It's prototype is function (collapse =
>> "", ...) but its body contains code such as
>>
>>      if (is.null(collapse))
>>          return(txt)
>>
>> I wonder where txt is created. It's not passed as an extra variable.
>>
>> Browse[7]> print(list(...))
>> list()
>>
>> Searching the R code reveals that txt is created as a global variable in
>> another function named dynCurlReader by the code statement txt <<-
>> character().
>>
>> RCurl also uses functions that don't begin with a dot but are undocumented.
>>
>> ans = encode(ans)
>> Browse[7]> ?encode
>> No documentation for ‘encode’ in specified packages and libraries
>>
>> Anyway, the transcript ID is also missing from txt.
>>
>> Browse[7]> grep("ENST00000485971", txt)
>> integer(0)
>>
>> It's hard to know what the obfuscated code of RCurl is doing.
>>
>> --------------------------------------
>> Dario Strbenac
>> University of Sydney
>> Camperdown NSW 2050
>> Australia
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=njpXf5TkV-BLhepKQRWVTdAzBmWWCsDj7RIPgs7FWBM&s=1qsHGdbNcXIIZCJDMeyF5e0R7MZWxNlxOxbXiH1zLGA&e=
>>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=njpXf5TkV-BLhepKQRWVTdAzBmWWCsDj7RIPgs7FWBM&s=1qsHGdbNcXIIZCJDMeyF5e0R7MZWxNlxOxbXiH1zLGA&e=
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-devel mailing list