[Bioc-devel] AnnotationHub: cleanup

Morgan, Martin Martin.Morgan at roswellpark.org
Tue Sep 15 06:25:11 CEST 2015

Hi Kasper -- we'll try to act on these, but some comments / looking for clarification...

> -----Original Message-----
> From: Bioc-devel [mailto:bioc-devel-bounces at r-project.org] On Behalf Of
> Kasper Daniel Hansen
> Sent: Monday, September 14, 2015 10:45 PM
> To: bioc-devel at r-project.org
> Subject: [Bioc-devel] AnnotationHub: cleanup
> I currently have the `pleasure` of dealing with students who have problems
> with installing AnnotationHub and/or downloading resources.  Here are some
> comments including some possible bug reports.

I hope this is on the whole a positive experience, and we'll do what we can to make it better.

> 1) I think it is extremely dangerous that `cache(ahub)` starts by asking to
> download all resources!  May I suggest this only happens with a specific
> setting like `cache(ahub, download=TRUE)` or something similar.

> 2) `cache(ahub)` deletes all cached information, except the sqlite database.
> Could we get a way to remove everything?
> 3) While I can understand the difference between cache and hubCache, I
> would suggest that hubCache(ahub) = NULL removes all cached material
> included the sqlite database.

For each of the above the envisioned use case was that  'hub' is a subset, eg.,

  subhub = query(hub, c("homo", "ensembl", "81"))

and the user wanted to manipulate all records in the sub-hub. cache(subhub) asks about the 'really download" if the size of the (sub)hub is greater than hubOption("MAX_DOWNLOADS"), which by default is 10; it seems like asking is the same as requiring an argument? fileName(subhub) may be closer to what you're looking for...? the path to the file name, or NA if It is not in the cache.

For cache(subhub) = NULL it wouldn't make sense to delete 5 resources AND the sqlite file for the entire hub.

The sqlite file can be discovered with dbfile(hub) / dbfile(subhub), and removed with file.remove(dbfile(subhub))). In some ways it wasn't envisioned that this manual manipulation would be a common use case (!).

> 4) It seems that AnnotationHub in the release version of Bioconductor
> defaults to using https://.  Wasn't full support for https:// introduced in R
> 3.2.2; if so, it seems to be a critical bug that it is using https://

AnnotationHub uses httr::GET and ultimately curl::curl_fetch_disk rather than native R support, so what R does is not directly relevant. From ?curl

     Drop-in replacement for base 'url' that supports https, ftps,
     gzip, deflate, etc. Default behavior is identical to 'url', but
     request can be fully configured by passing a custom 'handle'.

So I wonder what the actual problem is?

> 5) Perhaps it should be considered that the default hubCache path is
> versioned, perhaps with Bioc version, perhaps with something else.  This
> might cause problems for people running multiple versions of R.

The data base is supposed to handle versioning, so if you've populated the cache with Bioc 3.2 and are now accessing the cache with Bioc 3.1, only the 3.1 resources are visible. The hope was to avoid multiple copies of these possibly large resources.

> 6) I strongly suggest that the output printed when retrieving an
> AnnotationHub resource includes the download url.

Ok something that's easy to do! Sometimes this will be cryptic (when the resource is cached in the AnnotationHub server, rather than being retrieved from the original source)

> 7) If you run AnnotationHub without having GenomicRanges / rtracklayer
> installed, it downloads the resource and then pangs out with an error.  To me
> it seems more natural to pang out with an error immediately, especially since
> when it works, it appears from message printing that loading the library
> happens prior to download.

I guess by 'run AnnotationHub' you mean retrieve a specific resource?

The import recipes generally start by require()ing the necessary libraries. I spotted a couple of recipes that didn't follow this convention (for 2bit and chain file resources from rtracklayer; none that involved GenomicRanges). Are there specific examples?


> Best,
> Kasper
> 	[[alternative HTML version deleted]]
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel

This email message may contain legally privileged and/or confidential information.  If you are not the intended recipient(s), or the employee or agent responsible for the delivery of this message to the intended recipient(s), you are hereby notified that any disclosure, copying, distribution, or use of this email message is prohibited.  If you have received this message in error, please notify the sender immediately by e-mail and delete this email message from your computer. Thank you.

More information about the Bioc-devel mailing list