[Bioc-devel] AnnotationHub: cleanup
stvjc at channing.harvard.edu
Thu Sep 17 16:15:22 CEST 2015
I followed part of this interchange with interest. I would love to see
very wide adoption and appreciation of AnnotationHub and what I will
describe does not seem to constitute important obstacles to this, but I
have to confess that aspects of the model and grammar are confusing to me.
I use "cache" mainly as a noun. And in computing applications, IMHO, a
cache is something to be hidden far from the active interface. In
AnnotationHub "cache" names an important function and a key datastructure
for annotation archiving.
What I understand is (2.1.40):
ah = AnnotationHub() # creates object for file and database access, will
update db if appropriate
cache(ah) # will offer to acquire all available hub resources for local
caching, upon decline will provide
a named vector of paths
download 40503 resources? [y/n] n
I am not sure this vector is going to get much use. Maybe a negative
response should return NULL?
The help page says
cache(x)’ and ‘cache(x) <- value’: Adds (downloads) all resources in
‘x’, or removes all local resources corresponding to the
records in ‘x’ from the cache.
"download" seems like a reasonable name for part of this functionality.
to be concerned mainly with deletion. I can certainly define private
alternate terms for these tasks
in my .Rprofile but I do think a closer correspondence of function name to
action could pay off.
On Tue, Sep 15, 2015 at 10:34 AM, Kasper Daniel Hansen <
kasperdanielhansen at gmail.com> wrote:
> On Tue, Sep 15, 2015 at 12:25 AM, Morgan, Martin <
> Martin.Morgan at roswellpark.org> wrote:
> > Hi Kasper -- we'll try to act on these, but some comments / looking for
> > clarification...
> > > -----Original Message-----
> > > From: Bioc-devel [mailto:bioc-devel-bounces at r-project.org] On Behalf
> > > Kasper Daniel Hansen
> > > Sent: Monday, September 14, 2015 10:45 PM
> > > To: bioc-devel at r-project.org
> > > Subject: [Bioc-devel] AnnotationHub: cleanup
> > >
> > > I currently have the `pleasure` of dealing with students who have
> > problems
> > > with installing AnnotationHub and/or downloading resources. Here are
> > some
> > > comments including some possible bug reports.
> > I hope this is on the whole a positive experience, and we'll do what we
> > can to make it better.
> Well, I love the package and I love it even more having prepared material
> on it. And the people who complain is of course enriched for people who
> have problems - no way to know if it just works for most people.
> And of course right now it is more troublesome since I prepared the class
> using R-3.2.1 and then 3.2.2 was released just before we started and had
> the http -> https change which is an obvious suspect when people have
> download problems :)
> > 1) I think it is extremely dangerous that `cache(ahub)` starts by asking
> > > download all resources! May I suggest this only happens with a
> > > setting like `cache(ahub, download=TRUE)` or something similar.
> > >
> > > 2) `cache(ahub)` deletes all cached information, except the sqlite
> > database.
> > > Could we get a way to remove everything?
> > >
> > > 3) While I can understand the difference between cache and hubCache, I
> > > would suggest that hubCache(ahub) = NULL removes all cached material
> > > included the sqlite database.
> > For each of the above the envisioned use case was that 'hub' is a
> > eg.,
> > subhub = query(hub, c("homo", "ensembl", "81"))
> > and the user wanted to manipulate all records in the sub-hub.
> > cache(subhub) asks about the 'really download" if the size of the
> > is greater than hubOption("MAX_DOWNLOADS"), which by default is 10; it
> > seems like asking is the same as requiring an argument? fileName(subhub)
> > may be closer to what you're looking for...? the path to the file name,
> > NA if It is not in the cache.
> > For cache(subhub) = NULL it wouldn't make sense to delete 5 resources AND
> > the sqlite file for the entire hub.
> > The sqlite file can be discovered with dbfile(hub) / dbfile(subhub), and
> > removed with file.remove(dbfile(subhub))). In some ways it wasn't
> > envisioned that this manual manipulation would be a common use case (!).
> Ok. Let me perhaps rephrase my wish list
> 1) some easy way to reset the entire cache issue, with emphasis on easy.
> This is most likely to be used by beginners. Who it's done, I don't care
> to much about. And I suggest a heading in ?AnnotationHub called something
> like "Flushing the cache" or something.
> 2) It seems natural that there is a way (for problem reporting) to report
> which resources are cached, which is (again) easy and does not involve
> download. I don't care if it is cache() or some other name.
> > 4) It seems that AnnotationHub in the release version of Bioconductor
> > > defaults to using https://. Wasn't full support for https://
> > introduced in R
> > > 3.2.2; if so, it seems to be a critical bug that it is using https://
> > AnnotationHub uses httr::GET and ultimately curl::curl_fetch_disk rather
> > than native R support, so what R does is not directly relevant. From
> > Drop-in replacement for base 'url' that supports https, ftps,
> > gzip, deflate, etc. Default behavior is identical to 'url', but
> > request can be fully configured by passing a custom 'handle'.
> > So I wonder what the actual problem is?
> Interesting. Well, at least one user is behind a proxy and uses the tips
> in ?download.file to set a proxy server. Perhaps that doesn't work with
> httr? I don't know. But there are more than one person with problems.
> > 5) Perhaps it should be considered that the default hubCache path is
> > > versioned, perhaps with Bioc version, perhaps with something else.
> > > might cause problems for people running multiple versions of R.
> > The data base is supposed to handle versioning, so if you've populated
> > cache with Bioc 3.2 and are now accessing the cache with Bioc 3.1, only
> > 3.1 resources are visible. The hope was to avoid multiple copies of these
> > possibly large resources.
> That sounds pretty nifty.. I was thinking re-design of the database issues.
> > 6) I strongly suggest that the output printed when retrieving an
> > > AnnotationHub resource includes the download url.
> > Ok something that's easy to do! Sometimes this will be cryptic (when the
> > resource is cached in the AnnotationHub server, rather than being
> > from the original source)
> Perhaps it should just say "loading from cache"
> > > 7) If you run AnnotationHub without having GenomicRanges / rtracklayer
> > > installed, it downloads the resource and then pangs out with an error.
> > To me
> > > it seems more natural to pang out with an error immediately, especially
> > since
> > > when it works, it appears from message printing that loading the
> > > happens prior to download.
> > I guess by 'run AnnotationHub' you mean retrieve a specific resource?
> > The import recipes generally start by require()ing the necessary
> > libraries. I spotted a couple of recipes that didn't follow this
> > (for 2bit and chain file resources from rtracklayer; none that involved
> > GenomicRanges). Are there specific examples?
> As a test case I got a Windows virtual machine up and running, total clean,
> and just did biocLite("AnnotationHub"). Then I picked two random
> resources and tried to download them; one was a UCSC chain file and I don't
> know the other one. In both cases I totally got a decent error message,
> which I can fully understand. But looking at it with beginner eyes, I just
> thought it was weird that the error on missing a library happened after
> download. It's not a bit deal, but if you don't know what you're doing you
> might get confused.
> [[alternative HTML version deleted]]
> Bioc-devel at r-project.org mailing list
[[alternative HTML version deleted]]
More information about the Bioc-devel