[Bioc-devel] Remote BigWig file access
Vincent Carey
@tvjc @end|ng |rom ch@nn|ng@h@rv@rd@edu
Fri May 24 00:26:53 CEST 2024
thanks
On Thu, May 23, 2024 at 5:36 PM Chris Wilks (gmail) <broadsword using gmail.com>
wrote:
> Thanks Vince, understood about the Core's focus right now.
>
> I think this is something that Leo and I can fix among ourselves for the
> time being.
>
> Looking forward, as you brought up, if we were to refresh recount or
> produce a recount4 (discussed) we'd certainly consider additional coverage
> formats.
>
> I'm aware of tiledb though not duckdb (I'll have to check it out), thanks
> for the pointer.
>
> There's also the D4 format from Aaron Quinlan's lab from a few years ago
> which was explicitly designed to replace bigwigs:
> https://www.nature.com/articles/s43588-021-00085-0
>
> All that said, we're pretty committed to bigwigs at this point given the
> ~750,000 sequence runs we've encoded using them for recount3.
>
> On Wed, May 22, 2024 at 7:17 AM Vincent Carey <stvjc using channing.harvard.edu>
> wrote:
>
>> Really glad to see this discussion moving forward. I would say that the
>> core is wrangling with some
>> even lower-level technical concerns right now, so I can't jump in just
>> now. I just want to raise the question
>> of whether bigWig files are a technologically sound format to continue
>> investing in for the use case of
>> targeted remote query resolution on genomic coordinates. A number of new
>> concepts have come into
>> play since bigWig was designed and implemented. I'll naively mention
>> duckdb and tiledb, which seem
>> to have very good remote performance. Maybe these are too generic ...
>> are there other concepts in
>> GA4GH that might be relevant to leverage for recount-like projects in the
>> future?
>>
>>
>>
>> On Wed, May 22, 2024 at 6:58 AM Chris Wilks (gmail) <broadsword using gmail.com>
>> wrote:
>>
>>> Thanks for sharing Leo, this does interest me, especially since so much
>>> is
>>> built on BigWig access via rtracklayer at least in the recount2
>>> ecosystem.
>>>
>>> As you alluded to, Megadepth currently supports remote access of BigWigs
>>> (and BAMs) over HTTPS on all platforms (Linux, MacOS, and Windows),
>>> getting back just the byte ranges overlapping the set of regions
>>> requested
>>> so it should work for at least recount2/recount3 and anything that uses
>>> HTTP/s.
>>>
>>> I'd be open to exploring updates to the Megadepth C/C++ code side to
>>> support Rle if that makes sense to replace rtracklayer.
>>> But to do that you'd need to be involved in updating all the R packages
>>> if
>>> you're willing (both megadepth and those that currently rely on
>>> rtracklayer
>>> for this functionality).
>>>
>>> Let me know if you want to chat about this over Zoom,
>>> Chris
>>>
>>> On Tue, May 21, 2024 at 2:41 PM Leonardo Collado Torres <
>>> lcolladotor using gmail.com> wrote:
>>>
>>> > Hi Bioc-devel,
>>> >
>>> > As some of you are aware, rtracklayer::import() has long provided
>>> > access to import BigWig files. Those files can be shared on servers
>>> > and accessed remotely thanks to all the effort from many of you in
>>> > building and maintaining rtracklayer.
>>> >
>>> > From my side, derfinder::loadCoverage() relies on
>>> > rtracklayer::import.bw(), and recount::expressed_regions() +
>>> > recount::coverage_matrix() use derfinder::loadCoverage().
>>> > recountWorkflow showcases those recount functions on larger datasets.
>>> > brainflowprobes by Amanda Price, Nina Rajpurohit and others also ends
>>> > up relying on rtracklayer::import.bw() through these functions.
>>> >
>>> > At https://github.com/lawremi/rtracklayer/issues/83 I initially
>>> > reported some issues once our recount2/3 data host changed, but
>>> > previously Brian Schilder also reported that one could no longer read
>>> > remote files https://github.com/lawremi/rtracklayer/issues/73.
>>> > https://github.com/lawremi/rtracklayer/issues/63 and/or
>>> > https://github.com/lawremi/rtracklayer/issues/65 might have been
>>> > related.
>>> >
>>> > Yesterday I updated
>>> >
>>> https://github.com/lawremi/rtracklayer/issues/83#issuecomment-2121313270
>>> > with a comment showing some small reproducible code, and that the
>>> > workaround of downloading the data first, then using
>>> > rtracklayer::import() on the local data does work. However, this
>>> > workaround does involve a lot of, hmm, wasteful data transfer.
>>> >
>>> > On the recount vignette at some point I access just chrY of a bigWig
>>> > file that is about 1300 MB. On the recountWorkflow vignette I do
>>> > something similar for a 7GB bigWig file. Previously accessing just
>>> > chrY on these files was a small data transfer.
>>> >
>>> > On recountWorkflow version 1.29.2
>>> > https://github.com/LieberInstitute/recountWorkflow, I've included
>>> > pre-computed results (~2 MB) to avoid downloading tons of data, though
>>> > the vignette code shows how to actually fully reproduce the results if
>>> > you don't mind downloading those large files. I also implemented some
>>> > workarounds on recount, though I haven't yet gone the full route of
>>> > including pre-computed results. I have yet to try implementing a
>>> > workaround for brainflowprobes.
>>> >
>>> >
>>> >
>>> > My understanding is that rtracklayer's root issues are elsewhere and
>>> > changes in dependencies rtracklayer has likely created these problems.
>>> > These problems are not always in the control of rtracklayer authors to
>>> > resolve, and also create an unexpected burden on them.
>>> >
>>> > If one considers alternatives to rtracklayer, I see that there's a new
>>> > package https://github.com/PoisonAlien/trackplot/tree/master that uses
>>> > bwtool (a system dependency), and older alternative
>>> > https://github.com/andrelmartins/bigWig that hasn't had updates in 4
>>> > years, and a CRAN package
>>> > (https://cran.r-project.org/web/packages/wig/readme/README.html) that
>>> > recommends using rtracklayer for larger files. I guess that I could
>>> > also try using megadepth https://research.libd.org/megadepth/, though
>>> > derfinder::loadCoverage uses rtracklayer::import(as = "RleList") for
>>> > efficiency
>>> >
>>> https://github.com/lcolladotor/derfinder/blob/f9cd986e0c1b9ea6551d0d8d2077d4501216a661/R/loadCoverage.R#L401
>>> > and lots of functions in that package were built for that structure
>>> > (RleList objects). I likely missed other alternatives.
>>> >
>>> >
>>> > My current line of thought is to keep implementing workarounds using
>>> > local data (sometimes with pre-computed results) for recount,
>>> > recountWorkflow, and brainflowprobes (derfinder only has tests with
>>> > local bigWig files) without really altering the internals of those
>>> > packages. That is, assume that the remote BigWig file access via
>>> > rtracklayer will indefinitely be suspended, though it could be
>>> > supported again at some point and when it does, those packages will
>>> > work again with remote BigWig files as if nothing ever happened. But I
>>> > wanted to check in if this is what others who use BigWig files are
>>> > thinking of doing.
>>> >
>>> > Thanks!
>>> >
>>> > Best,
>>> > Leo
>>> >
>>> >
>>> > Leonardo Collado Torres, Ph. D.
>>> > Investigator, LIEBER INSTITUTE for BRAIN DEVELOPMENT
>>> > Assistant Professor, Department of Biostatistics
>>> > Johns Hopkins Bloomberg School of Public Health
>>> > 855 N. Wolfe St., Room 382
>>> <https://www.google.com/maps/search/855+N.+Wolfe+St.,+Room+382+%0D%0A+Baltimore,+MD+21205?entry=gmail&source=g>
>>>
>>> <https://www.google.com/maps/search/855+N.+Wolfe+St.,+Room+382+%0D%0A+Baltimore,+MD+21205?entry=gmail&source=g>>
>>> Baltimore, MD 21205
>>> <https://www.google.com/maps/search/855+N.+Wolfe+St.,+Room+382+%0D%0A+Baltimore,+MD+21205?entry=gmail&source=g>
>>> > lcolladotor.github.io
>>> > lcolladotor using gmail.com
>>> >
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioc-devel using r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>
>> The information in this email is intended only for the person to whom it
>> is addressed. If you believe this e-mail was sent to you in error and
>> the email contains patient information, please contact the Partners
>> Compliance HelpLine at http://www.partners.org/complianceline . If the
>> email was sent to you in error but does not contain patient information,
>> please contact the sender and properly dispose of the email.
>
>
--
The information in this email is intended only for the p...{{dropped:15}}
More information about the Bioc-devel
mailing list