[Bioc-devel] AnnotationHub: cleanup

Thu Sep 17 16:15:22 CEST 2015

I followed part of this interchange with interest.  I would love to see
very wide adoption and appreciation of AnnotationHub and what I will
describe does not seem to constitute important obstacles to this, but I
have to confess that aspects of the model and grammar are confusing to me.

I use "cache" mainly as a noun.  And in computing applications, IMHO, a
cache is something to be hidden far from the active interface.  In
AnnotationHub "cache" names an important function and a key datastructure
for annotation archiving.

What I understand is (2.1.40):

ah = AnnotationHub()  # creates object for file and database access, will
update db if  appropriate
cache(ah)  #  will offer to acquire all available hub resources for local
caching, upon decline will provide
a named vector of paths

> cache(ah)

download 40503 resources? [y/n] n

                             AH5086                              AH5087

 "/Users/stvjc/.AnnotationHub/5086"  "/Users/stvjc/.AnnotationHub/5087"

                            AH14108                             AH15146

I am not sure this vector is going to get much use.  Maybe a negative
response should return NULL?

The help page says

cache(x)’ and ‘cache(x) <- value’: Adds (downloads) all resources in

          ‘x’, or removes all local resources corresponding to the

          records in ‘x’ from the cache.

"download" seems like a reasonable name for part of this functionality.
 "cache<-" seems

to be concerned mainly with deletion.  I can certainly define private
alternate terms for these tasks

in my .Rprofile but I do think a closer correspondence of function name to
action could pay off.

On Tue, Sep 15, 2015 at 10:34 AM, Kasper Daniel Hansen <
kasperdanielhansen at gmail.com> wrote:

> On Tue, Sep 15, 2015 at 12:25 AM, Morgan, Martin <
> Martin.Morgan at roswellpark.org> wrote:
>
> > Hi Kasper -- we'll try to act on these, but some comments / looking for
> > clarification...
> >
> > > -----Original Message-----
> > > From: Bioc-devel [mailto:bioc-devel-bounces at r-project.org] On Behalf
> Of
> > > Kasper Daniel Hansen
> > > Sent: Monday, September 14, 2015 10:45 PM
> > > To: bioc-devel at r-project.org
> > > Subject: [Bioc-devel] AnnotationHub: cleanup
> > >
> > > I currently have the `pleasure` of dealing with students who have
> > problems
> > > with installing AnnotationHub and/or downloading resources.  Here are
> > some
> > > comments including some possible bug reports.
> >
> > I hope this is on the whole a positive experience, and we'll do what we
> > can to make it better.
> >
>
> Well, I love the package and I love it even more having prepared material
> on it.  And the people who complain is of course enriched for people who
> have problems - no way to know if it just works for most people.
>
> And of course right now it is more troublesome since I prepared the class
> using R-3.2.1 and then 3.2.2 was released just before we started and had
> the http -> https change which is an obvious suspect when people have
> download problems :)
>
>  > 1) I think it is extremely dangerous that `cache(ahub)` starts by asking
> to
>
> > > download all resources!  May I suggest this only happens with a
> specific
> > > setting like `cache(ahub, download=TRUE)` or something similar.
> >
> > >
> > > 2) `cache(ahub)` deletes all cached information, except the sqlite
> > database.
> > > Could we get a way to remove everything?
> > >
> > > 3) While I can understand the difference between cache and hubCache, I
> > > would suggest that hubCache(ahub) = NULL removes all cached material
> > > included the sqlite database.
> >
> > For each of the above the envisioned use case was that  'hub' is a
> subset,
> > eg.,
> >
> >   subhub = query(hub, c("homo", "ensembl", "81"))
> >
> > and the user wanted to manipulate all records in the sub-hub.
> > cache(subhub) asks about the 'really download" if the size of the
> (sub)hub
> > is greater than hubOption("MAX_DOWNLOADS"), which by default is 10; it
> > seems like asking is the same as requiring an argument? fileName(subhub)
> > may be closer to what you're looking for...? the path to the file name,
> or
> > NA if It is not in the cache.
> >
> > For cache(subhub) = NULL it wouldn't make sense to delete 5 resources AND
> > the sqlite file for the entire hub.
> >
> > The sqlite file can be discovered with dbfile(hub) / dbfile(subhub), and
> > removed with file.remove(dbfile(subhub))). In some ways it wasn't
> > envisioned that this manual manipulation would be a common use case (!).
>
>
> Ok.  Let me perhaps rephrase my wish list
> 1) some easy way to reset the entire cache issue, with emphasis on easy.
> This is most likely to be used by beginners.  Who it's done, I don't care
> to much about.  And I suggest a heading in ?AnnotationHub called something
> like "Flushing the cache" or something.
> 2) It seems natural that there is a way (for problem reporting) to report
> which resources are cached, which is (again) easy and does not involve
> download.  I don't care if it is cache() or some other name.
>
> > 4) It seems that AnnotationHub in the release version of Bioconductor
> > > defaults to using https://.  Wasn't full support for https://
> > introduced in R
> > > 3.2.2; if so, it seems to be a critical bug that it is using https://
> >
> > AnnotationHub uses httr::GET and ultimately curl::curl_fetch_disk rather
> > than native R support, so what R does is not directly relevant. From
> ?curl
> >
> >      Drop-in replacement for base 'url' that supports https, ftps,
> >      gzip, deflate, etc. Default behavior is identical to 'url', but
> >      request can be fully configured by passing a custom 'handle'.
> >
> > So I wonder what the actual problem is?
> >
>
> Interesting.  Well, at least one user is behind a proxy and uses the tips
> in ?download.file to set a proxy server.  Perhaps that doesn't work with
> httr?  I don't know.  But there are more than one person with problems.
>
> > 5) Perhaps it should be considered that the default hubCache path is
> > > versioned, perhaps with Bioc version, perhaps with something else.
> This
> > > might cause problems for people running multiple versions of R.
> >
> > The data base is supposed to handle versioning, so if you've populated
> the
> > cache with Bioc 3.2 and are now accessing the cache with Bioc 3.1, only
> the
> > 3.1 resources are visible. The hope was to avoid multiple copies of these
> > possibly large resources.
>
>
> That sounds pretty nifty.. I was thinking re-design of the database issues.
>
>
> > 6) I strongly suggest that the output printed when retrieving an
>
> > > AnnotationHub resource includes the download url.
> >
> > Ok something that's easy to do! Sometimes this will be cryptic (when the
> > resource is cached in the AnnotationHub server, rather than being
> retrieved
> > from the original source)
>
>
> Perhaps it should just say "loading from cache"
>
>
> > > 7) If you run AnnotationHub without having GenomicRanges / rtracklayer
> > > installed, it downloads the resource and then pangs out with an error.
> > To me
> > > it seems more natural to pang out with an error immediately, especially
> > since
> > > when it works, it appears from message printing that loading the
> library
> > > happens prior to download.
> >
> > I guess by 'run AnnotationHub' you mean retrieve a specific resource?
> >
> > The import recipes generally start by require()ing the necessary
> > libraries. I spotted a couple of recipes that didn't follow this
> convention
> > (for 2bit and chain file resources from rtracklayer; none that involved
> > GenomicRanges). Are there specific examples?
> >
>
> As a test case I got a Windows virtual machine up and running, total clean,
>  and just did biocLite("AnnotationHub").  Then I picked two random
> resources and tried to download them; one was a UCSC chain file and I don't
> know the other one.  In both cases I totally got a decent error message,
> which I can fully understand.  But looking at it with beginner eyes, I just
> thought it was weird that the error on missing a library happened after
> download.  It's not a bit deal, but if you don't know what you're doing you
> might get confused.
>
> Best,
> Kasper
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

	[[alternative HTML version deleted]]