[Bioc-devel] exonsBy dropping genes from TxDb

Mike Smith grimbough at gmail.com
Sun Oct 29 00:32:37 CEST 2017


My feeling is that this isn't related to RCurl, since the example
transcript is missing if you use httr to submit the query instead.  You can
check this with the code at
https://gist.github.com/grimbough/7e7a47b7a4f64915220ce35cc1ce8f39


I wonder if this is related to BioMart's instability when you submit large
queries.  The web interface suggests no more than 500 entries when
filtering on something like a gene ID, but in these examples we're applying
no filter at all.  Problems related to this have been reported before (
https://support.bioconductor.org/p/86358/) and I patched the devel version
of biomaRt to submit queries in batches of 500 - although this currently
has no effect if there isn't a set of 'values' to split into batches.

Interestingly, if I first obtain a list of all gene IDs in Ensembl, and
then submit those in a batched query I get more exons than your first
approach.

all_genes <- getBM(attributes = "ensembl_gene_id", mart=mart)
attributes1 <- c("ensembl_transcript_id", "ensembl_exon_id")
attributes2 <- c(attributes1, "rank", "genomic_coding_start", "cds_start",
"5_utr_start")
df3 <- getBM(attributes1, filters = "ensembl_gene_id", values =
all_genes[,1], mart=mart)
df4 <- getBM(attributes2, filters = "ensembl_gene_id", values =
all_genes[,1], mart=mart)

> dim(df3)
[1] 1309356       6

The number of returned results is consistent for even with the larger
number of attributes

> identical(df3[,1], df4[,1])
[1] TRUE

And some of these transcripts are clearly not present in the first query:

> length(which( !unique(df3[,"ensembl_transcript_id"]) %in% df1[,
"ensembl_transcript_id"] ))
[1] 28128

It could be that my batched code is doing something wrong, but I'm starting
to suspect that even the first query is dropping data silently!

Mike


On 28 October 2017 at 13:00, Dario Strbenac <dstr7320 at uni.sydney.edu.au>
wrote:

> Good day,
>
> I stepped through the code until execution reached the end of postForm in
> RCurl which is called by getBM and obtains the textual result from the
> server. If I check the contents of write$value(), the example missing
> transcript is not there.
>
> Browse[3]> grep("ENST00000485971", write$value())
> integer(0)
>
> write$value is a weird function. It's prototype is function (collapse =
> "", ...) but its body contains code such as
>
>     if (is.null(collapse))
>         return(txt)
>
> I wonder where txt is created. It's not passed as an extra variable.
>
> Browse[7]> print(list(...))
> list()
>
> Searching the R code reveals that txt is created as a global variable in
> another function named dynCurlReader by the code statement txt <<-
> character().
>
> RCurl also uses functions that don't begin with a dot but are undocumented.
>
> ans = encode(ans)
> Browse[7]> ?encode
> No documentation for ‘encode’ in specified packages and libraries
>
> Anyway, the transcript ID is also missing from txt.
>
> Browse[7]> grep("ENST00000485971", txt)
> integer(0)
>
> It's hard to know what the obfuscated code of RCurl is doing.
>
> --------------------------------------
> Dario Strbenac
> University of Sydney
> Camperdown NSW 2050
> Australia
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list