[Bioc-devel] Remote BigWig file access

Chris Wilks (gmail) bro@d@word @end|ng |rom gm@||@com
Thu May 23 23:35:57 CEST 2024


Thanks Vince, understood about the Core's focus right now.

 I think this is something that Leo and I can fix among ourselves for the
time being.

Looking forward, as you brought up, if we were to refresh recount or
produce a recount4 (discussed) we'd certainly consider additional coverage
formats.

I'm aware of tiledb though not duckdb (I'll have to check it out), thanks
for the pointer.

There's also the D4 format from Aaron Quinlan's lab from a few years ago
which was explicitly designed to replace bigwigs:
https://www.nature.com/articles/s43588-021-00085-0

All that said, we're pretty committed to bigwigs at this point given the
~750,000 sequence runs we've encoded using them for recount3.

On Wed, May 22, 2024 at 7:17 AM Vincent Carey <stvjc using channing.harvard.edu>
wrote:

> Really glad to see this discussion moving forward.  I would say that the
> core is wrangling with some
> even lower-level technical concerns right now, so I can't jump in just
> now.  I just want to raise the question
> of whether bigWig files are a technologically sound format to continue
> investing in for the use case of
> targeted remote query resolution on genomic coordinates.  A number of new
> concepts have come into
> play since bigWig was designed and implemented.  I'll naively mention
> duckdb and tiledb, which seem
> to have very good remote performance.  Maybe these are too generic ... are
> there other concepts in
> GA4GH that might be relevant to leverage for recount-like projects in the
> future?
>
>
>
> On Wed, May 22, 2024 at 6:58 AM Chris Wilks (gmail) <broadsword using gmail.com>
> wrote:
>
>> Thanks for sharing Leo, this does interest me, especially since so much is
>> built on BigWig access via rtracklayer at least in the recount2 ecosystem.
>>
>> As you alluded to, Megadepth currently supports remote access of BigWigs
>> (and BAMs) over HTTPS on all platforms (Linux, MacOS, and Windows),
>> getting back just the byte ranges overlapping the set of regions requested
>> so it should work for at least recount2/recount3 and anything that uses
>> HTTP/s.
>>
>> I'd be open to exploring updates to the Megadepth C/C++ code side to
>> support Rle if that makes sense to replace rtracklayer.
>> But to do that you'd need to be involved in updating all the R packages if
>> you're willing (both megadepth and those that currently rely on
>> rtracklayer
>> for this functionality).
>>
>> Let me know if you want to chat about this over Zoom,
>> Chris
>>
>> On Tue, May 21, 2024 at 2:41 PM Leonardo Collado Torres <
>> lcolladotor using gmail.com> wrote:
>>
>> > Hi Bioc-devel,
>> >
>> > As some of you are aware, rtracklayer::import() has long provided
>> > access to import BigWig files. Those files can be shared on servers
>> > and accessed remotely thanks to all the effort from many of you in
>> > building and maintaining rtracklayer.
>> >
>> > From my side, derfinder::loadCoverage() relies on
>> > rtracklayer::import.bw(), and recount::expressed_regions() +
>> > recount::coverage_matrix() use derfinder::loadCoverage().
>> > recountWorkflow showcases those recount functions on larger datasets.
>> > brainflowprobes by Amanda Price, Nina Rajpurohit and others also ends
>> > up relying on rtracklayer::import.bw() through these functions.
>> >
>> > At https://github.com/lawremi/rtracklayer/issues/83 I initially
>> > reported some issues once our recount2/3 data host changed, but
>> > previously Brian Schilder also reported that one could no longer read
>> > remote files https://github.com/lawremi/rtracklayer/issues/73.
>> > https://github.com/lawremi/rtracklayer/issues/63 and/or
>> > https://github.com/lawremi/rtracklayer/issues/65 might have been
>> > related.
>> >
>> > Yesterday I updated
>> >
>> https://github.com/lawremi/rtracklayer/issues/83#issuecomment-2121313270
>> > with a comment showing some small reproducible code, and that the
>> > workaround of downloading the data first, then using
>> > rtracklayer::import() on the local data does work. However, this
>> > workaround does involve a lot of, hmm, wasteful data transfer.
>> >
>> > On the recount vignette at some point I access just chrY of a bigWig
>> > file that is about 1300 MB. On the recountWorkflow vignette I do
>> > something similar for a 7GB bigWig file. Previously accessing just
>> > chrY on these files was a small data transfer.
>> >
>> > On recountWorkflow version 1.29.2
>> > https://github.com/LieberInstitute/recountWorkflow, I've included
>> > pre-computed results (~2 MB) to avoid downloading tons of data, though
>> > the vignette code shows how to actually fully reproduce the results if
>> > you don't mind downloading those large files. I also implemented some
>> > workarounds on recount, though I haven't yet gone the full route of
>> > including pre-computed results. I have yet to try implementing a
>> > workaround for brainflowprobes.
>> >
>> >
>> >
>> > My understanding is that rtracklayer's root issues are elsewhere and
>> > changes in dependencies rtracklayer has likely created these problems.
>> > These problems are not always in the control of rtracklayer authors to
>> > resolve, and also create an unexpected burden on them.
>> >
>> > If one considers alternatives to rtracklayer, I see that there's a new
>> > package https://github.com/PoisonAlien/trackplot/tree/master that uses
>> > bwtool (a system dependency), and older alternative
>> > https://github.com/andrelmartins/bigWig that hasn't had updates in 4
>> > years, and a CRAN package
>> > (https://cran.r-project.org/web/packages/wig/readme/README.html) that
>> > recommends using rtracklayer for larger files. I guess that I could
>> > also try using megadepth https://research.libd.org/megadepth/, though
>> > derfinder::loadCoverage uses rtracklayer::import(as = "RleList") for
>> > efficiency
>> >
>> https://github.com/lcolladotor/derfinder/blob/f9cd986e0c1b9ea6551d0d8d2077d4501216a661/R/loadCoverage.R#L401
>> > and lots of functions in that package were built for that structure
>> > (RleList objects). I likely missed other alternatives.
>> >
>> >
>> > My current line of thought is to keep implementing workarounds using
>> > local data (sometimes with pre-computed results) for recount,
>> > recountWorkflow, and brainflowprobes (derfinder only has tests with
>> > local bigWig files) without really altering the internals of those
>> > packages. That is, assume that the remote BigWig file access via
>> > rtracklayer will indefinitely be suspended, though it could be
>> > supported again at some point and when it does, those packages will
>> > work again with remote BigWig files as if nothing ever happened. But I
>> > wanted to check in if this is what others who use BigWig files are
>> > thinking of doing.
>> >
>> > Thanks!
>> >
>> > Best,
>> > Leo
>> >
>> >
>> > Leonardo Collado Torres, Ph. D.
>> > Investigator, LIEBER INSTITUTE for BRAIN DEVELOPMENT
>> > Assistant Professor, Department of Biostatistics
>> > Johns Hopkins Bloomberg School of Public Health
>> > 855 N. Wolfe St., Room 382
>> > Baltimore, MD 21205
>> > lcolladotor.github.io
>> > lcolladotor using gmail.com
>> >
>>
>>         [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-devel using r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
> The information in this email is intended only for the...{{dropped:11}}



More information about the Bioc-devel mailing list