[Bioc-devel] Remote BigWig file access

Vincent Carey @tvjc @end|ng |rom ch@nn|ng@h@rv@rd@edu
Wed May 22 13:17:42 CEST 2024


Really glad to see this discussion moving forward.  I would say that the
core is wrangling with some
even lower-level technical concerns right now, so I can't jump in just
now.  I just want to raise the question
of whether bigWig files are a technologically sound format to continue
investing in for the use case of
targeted remote query resolution on genomic coordinates.  A number of new
concepts have come into
play since bigWig was designed and implemented.  I'll naively mention
duckdb and tiledb, which seem
to have very good remote performance.  Maybe these are too generic ... are
there other concepts in
GA4GH that might be relevant to leverage for recount-like projects in the
future?



On Wed, May 22, 2024 at 6:58 AM Chris Wilks (gmail) <broadsword using gmail.com>
wrote:

> Thanks for sharing Leo, this does interest me, especially since so much is
> built on BigWig access via rtracklayer at least in the recount2 ecosystem.
>
> As you alluded to, Megadepth currently supports remote access of BigWigs
> (and BAMs) over HTTPS on all platforms (Linux, MacOS, and Windows),
> getting back just the byte ranges overlapping the set of regions requested
> so it should work for at least recount2/recount3 and anything that uses
> HTTP/s.
>
> I'd be open to exploring updates to the Megadepth C/C++ code side to
> support Rle if that makes sense to replace rtracklayer.
> But to do that you'd need to be involved in updating all the R packages if
> you're willing (both megadepth and those that currently rely on rtracklayer
> for this functionality).
>
> Let me know if you want to chat about this over Zoom,
> Chris
>
> On Tue, May 21, 2024 at 2:41 PM Leonardo Collado Torres <
> lcolladotor using gmail.com> wrote:
>
> > Hi Bioc-devel,
> >
> > As some of you are aware, rtracklayer::import() has long provided
> > access to import BigWig files. Those files can be shared on servers
> > and accessed remotely thanks to all the effort from many of you in
> > building and maintaining rtracklayer.
> >
> > From my side, derfinder::loadCoverage() relies on
> > rtracklayer::import.bw(), and recount::expressed_regions() +
> > recount::coverage_matrix() use derfinder::loadCoverage().
> > recountWorkflow showcases those recount functions on larger datasets.
> > brainflowprobes by Amanda Price, Nina Rajpurohit and others also ends
> > up relying on rtracklayer::import.bw() through these functions.
> >
> > At https://github.com/lawremi/rtracklayer/issues/83 I initially
> > reported some issues once our recount2/3 data host changed, but
> > previously Brian Schilder also reported that one could no longer read
> > remote files https://github.com/lawremi/rtracklayer/issues/73.
> > https://github.com/lawremi/rtracklayer/issues/63 and/or
> > https://github.com/lawremi/rtracklayer/issues/65 might have been
> > related.
> >
> > Yesterday I updated
> > https://github.com/lawremi/rtracklayer/issues/83#issuecomment-2121313270
> > with a comment showing some small reproducible code, and that the
> > workaround of downloading the data first, then using
> > rtracklayer::import() on the local data does work. However, this
> > workaround does involve a lot of, hmm, wasteful data transfer.
> >
> > On the recount vignette at some point I access just chrY of a bigWig
> > file that is about 1300 MB. On the recountWorkflow vignette I do
> > something similar for a 7GB bigWig file. Previously accessing just
> > chrY on these files was a small data transfer.
> >
> > On recountWorkflow version 1.29.2
> > https://github.com/LieberInstitute/recountWorkflow, I've included
> > pre-computed results (~2 MB) to avoid downloading tons of data, though
> > the vignette code shows how to actually fully reproduce the results if
> > you don't mind downloading those large files. I also implemented some
> > workarounds on recount, though I haven't yet gone the full route of
> > including pre-computed results. I have yet to try implementing a
> > workaround for brainflowprobes.
> >
> >
> >
> > My understanding is that rtracklayer's root issues are elsewhere and
> > changes in dependencies rtracklayer has likely created these problems.
> > These problems are not always in the control of rtracklayer authors to
> > resolve, and also create an unexpected burden on them.
> >
> > If one considers alternatives to rtracklayer, I see that there's a new
> > package https://github.com/PoisonAlien/trackplot/tree/master that uses
> > bwtool (a system dependency), and older alternative
> > https://github.com/andrelmartins/bigWig that hasn't had updates in 4
> > years, and a CRAN package
> > (https://cran.r-project.org/web/packages/wig/readme/README.html) that
> > recommends using rtracklayer for larger files. I guess that I could
> > also try using megadepth https://research.libd.org/megadepth/, though
> > derfinder::loadCoverage uses rtracklayer::import(as = "RleList") for
> > efficiency
> >
> https://github.com/lcolladotor/derfinder/blob/f9cd986e0c1b9ea6551d0d8d2077d4501216a661/R/loadCoverage.R#L401
> > and lots of functions in that package were built for that structure
> > (RleList objects). I likely missed other alternatives.
> >
> >
> > My current line of thought is to keep implementing workarounds using
> > local data (sometimes with pre-computed results) for recount,
> > recountWorkflow, and brainflowprobes (derfinder only has tests with
> > local bigWig files) without really altering the internals of those
> > packages. That is, assume that the remote BigWig file access via
> > rtracklayer will indefinitely be suspended, though it could be
> > supported again at some point and when it does, those packages will
> > work again with remote BigWig files as if nothing ever happened. But I
> > wanted to check in if this is what others who use BigWig files are
> > thinking of doing.
> >
> > Thanks!
> >
> > Best,
> > Leo
> >
> >
> > Leonardo Collado Torres, Ph. D.
> > Investigator, LIEBER INSTITUTE for BRAIN DEVELOPMENT
> > Assistant Professor, Department of Biostatistics
> > Johns Hopkins Bloomberg School of Public Health
> > 855 N. Wolfe St., Room 382
> > Baltimore, MD 21205
> > lcolladotor.github.io
> > lcolladotor using gmail.com
> >
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

-- 
The information in this email is intended only for the p...{{dropped:15}}



More information about the Bioc-devel mailing list