[Bioc-devel] Remote BigWig file access
Håkon Tjeldnes
h@uken_heyken @end|ng |rom hotm@||@com
Sat May 25 18:02:52 CEST 2024
We have been experimenting with other formats in our package ORFik, the fst format is also a good candidate, though the problem is that only R and Julia supports it currently. My biggest problems with bigwigs are the slow full file access time and not supporting multiple score columns (as far as I know).
Sent from Outlook for Android<https://aka.ms/AAb9ysg>
________________________________
From: Bioc-devel <bioc-devel-bounces using r-project.org> on behalf of Vincent Carey <stvjc using channing.harvard.edu>
Sent: Friday, May 24, 2024 12:26:53 AM
To: Chris Wilks (gmail) <broadsword using gmail.com>
Cc: Price, Amanda (NIH/NICHD) [E] <amanda.price using nih.gov>; Bioc-devel <bioc-devel using r-project.org>; Nina Rajpurohit <Nina.Rajpurohit using libd.org>; Jaffe, Andrew E. <andrewejaffe using gmail.com>
Subject: Re: [Bioc-devel] Remote BigWig file access
thanks
On Thu, May 23, 2024 at 5:36 PM Chris Wilks (gmail) <broadsword using gmail.com>
wrote:
> Thanks Vince, understood about the Core's focus right now.
>
> I think this is something that Leo and I can fix among ourselves for the
> time being.
>
> Looking forward, as you brought up, if we were to refresh recount or
> produce a recount4 (discussed) we'd certainly consider additional coverage
> formats.
>
> I'm aware of tiledb though not duckdb (I'll have to check it out), thanks
> for the pointer.
>
> There's also the D4 format from Aaron Quinlan's lab from a few years ago
> which was explicitly designed to replace bigwigs:
> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.nature.com%2Farticles%2Fs43588-021-00085-0&data=05%7C02%7C%7C3ba45fd4eedc4345092308dc7b778b9f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638521000591663672%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=DHohOJ341h1sk4SvxQDTMAzIBRk23qUCdKaKl1WrloQ%3D&reserved=0<https://www.nature.com/articles/s43588-021-00085-0>
>
> All that said, we're pretty committed to bigwigs at this point given the
> ~750,000 sequence runs we've encoded using them for recount3.
>
> On Wed, May 22, 2024 at 7:17 AM Vincent Carey <stvjc using channing.harvard.edu>
> wrote:
>
>> Really glad to see this discussion moving forward. I would say that the
>> core is wrangling with some
>> even lower-level technical concerns right now, so I can't jump in just
>> now. I just want to raise the question
>> of whether bigWig files are a technologically sound format to continue
>> investing in for the use case of
>> targeted remote query resolution on genomic coordinates. A number of new
>> concepts have come into
>> play since bigWig was designed and implemented. I'll naively mention
>> duckdb and tiledb, which seem
>> to have very good remote performance. Maybe these are too generic ...
>> are there other concepts in
>> GA4GH that might be relevant to leverage for recount-like projects in the
>> future?
>>
>>
>>
>> On Wed, May 22, 2024 at 6:58 AM Chris Wilks (gmail) <broadsword using gmail.com>
>> wrote:
>>
>>> Thanks for sharing Leo, this does interest me, especially since so much
>>> is
>>> built on BigWig access via rtracklayer at least in the recount2
>>> ecosystem.
>>>
>>> As you alluded to, Megadepth currently supports remote access of BigWigs
>>> (and BAMs) over HTTPS on all platforms (Linux, MacOS, and Windows),
>>> getting back just the byte ranges overlapping the set of regions
>>> requested
>>> so it should work for at least recount2/recount3 and anything that uses
>>> HTTP/s.
>>>
>>> I'd be open to exploring updates to the Megadepth C/C++ code side to
>>> support Rle if that makes sense to replace rtracklayer.
>>> But to do that you'd need to be involved in updating all the R packages
>>> if
>>> you're willing (both megadepth and those that currently rely on
>>> rtracklayer
>>> for this functionality).
>>>
>>> Let me know if you want to chat about this over Zoom,
>>> Chris
>>>
>>> On Tue, May 21, 2024 at 2:41 PM Leonardo Collado Torres <
>>> lcolladotor using gmail.com> wrote:
>>>
>>> > Hi Bioc-devel,
>>> >
>>> > As some of you are aware, rtracklayer::import() has long provided
>>> > access to import BigWig files. Those files can be shared on servers
>>> > and accessed remotely thanks to all the effort from many of you in
>>> > building and maintaining rtracklayer.
>>> >
>>> > From my side, derfinder::loadCoverage() relies on
>>> > rtracklayer::import.bw(), and recount::expressed_regions() +
>>> > recount::coverage_matrix() use derfinder::loadCoverage().
>>> > recountWorkflow showcases those recount functions on larger datasets.
>>> > brainflowprobes by Amanda Price, Nina Rajpurohit and others also ends
>>> > up relying on rtracklayer::import.bw() through these functions.
>>> >
>>> > At https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flawremi%2Frtracklayer%2Fissues%2F83&data=05%7C02%7C%7C3ba45fd4eedc4345092308dc7b778b9f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638521000591674927%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=sH%2Ftb%2Bpd9fR2dA5KG8jrK%2BroY9AsgQveyxCDrX%2BIh0M%3D&reserved=0<https://github.com/lawremi/rtracklayer/issues/83> I initially
>>> > reported some issues once our recount2/3 data host changed, but
>>> > previously Brian Schilder also reported that one could no longer read
>>> > remote files https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flawremi%2Frtracklayer%2Fissues%2F73&data=05%7C02%7C%7C3ba45fd4eedc4345092308dc7b778b9f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638521000591682301%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=Cf21Kpi18LrhoS9ekBJfg8ZqcNyO28K2UqVgpMrg3OU%3D&reserved=0<https://github.com/lawremi/rtracklayer/issues/73>.
>>> > https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flawremi%2Frtracklayer%2Fissues%2F63&data=05%7C02%7C%7C3ba45fd4eedc4345092308dc7b778b9f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638521000591687305%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=zQAHfwRYJH25lXovMV5ceMKfgrJsWX8jNUpELb%2BMocI%3D&reserved=0<https://github.com/lawremi/rtracklayer/issues/63> and/or
>>> > https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flawremi%2Frtracklayer%2Fissues%2F65&data=05%7C02%7C%7C3ba45fd4eedc4345092308dc7b778b9f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638521000591691768%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=5YiBXZUZlLgFFBXPF2Wy6ZrR9YfKYbvY7VKiEivAUP8%3D&reserved=0<https://github.com/lawremi/rtracklayer/issues/65> might have been
>>> > related.
>>> >
>>> > Yesterday I updated
>>> >
>>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flawremi%2Frtracklayer%2Fissues%2F83%23issuecomment-2121313270&data=05%7C02%7C%7C3ba45fd4eedc4345092308dc7b778b9f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638521000591695920%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=jDkGCwhMer83WdbiV8b3jrNj0SebuXk8v%2BLdiGsOfGk%3D&reserved=0<https://github.com/lawremi/rtracklayer/issues/83#issuecomment-2121313270>
>>> > with a comment showing some small reproducible code, and that the
>>> > workaround of downloading the data first, then using
>>> > rtracklayer::import() on the local data does work. However, this
>>> > workaround does involve a lot of, hmm, wasteful data transfer.
>>> >
>>> > On the recount vignette at some point I access just chrY of a bigWig
>>> > file that is about 1300 MB. On the recountWorkflow vignette I do
>>> > something similar for a 7GB bigWig file. Previously accessing just
>>> > chrY on these files was a small data transfer.
>>> >
>>> > On recountWorkflow version 1.29.2
>>> > https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FLieberInstitute%2FrecountWorkflow&data=05%7C02%7C%7C3ba45fd4eedc4345092308dc7b778b9f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638521000591699581%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=4UgE%2FgXjb9Jq42PUl60YykpmO3Fx57yydl64mmTL%2F8o%3D&reserved=0<https://github.com/LieberInstitute/recountWorkflow>, I've included
>>> > pre-computed results (~2 MB) to avoid downloading tons of data, though
>>> > the vignette code shows how to actually fully reproduce the results if
>>> > you don't mind downloading those large files. I also implemented some
>>> > workarounds on recount, though I haven't yet gone the full route of
>>> > including pre-computed results. I have yet to try implementing a
>>> > workaround for brainflowprobes.
>>> >
>>> >
>>> >
>>> > My understanding is that rtracklayer's root issues are elsewhere and
>>> > changes in dependencies rtracklayer has likely created these problems.
>>> > These problems are not always in the control of rtracklayer authors to
>>> > resolve, and also create an unexpected burden on them.
>>> >
>>> > If one considers alternatives to rtracklayer, I see that there's a new
>>> > package https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FPoisonAlien%2Ftrackplot%2Ftree%2Fmaster&data=05%7C02%7C%7C3ba45fd4eedc4345092308dc7b778b9f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638521000591703209%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=YtOcG8dga4CvpxmuMwnjUr5I8TGgngvlVai1Mhzh5Kg%3D&reserved=0<https://github.com/PoisonAlien/trackplot/tree/master> that uses
>>> > bwtool (a system dependency), and older alternative
>>> > https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fandrelmartins%2FbigWig&data=05%7C02%7C%7C3ba45fd4eedc4345092308dc7b778b9f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638521000591706974%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=fM%2BU%2BgYVpN7mkTQWSUzXBVImPmc9p0%2Ff2kfWb0rdJ%2BI%3D&reserved=0<https://github.com/andrelmartins/bigWig> that hasn't had updates in 4
>>> > years, and a CRAN package
>>> > (https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcran.r-project.org%2Fweb%2Fpackages%2Fwig%2Freadme%2FREADME.html&data=05%7C02%7C%7C3ba45fd4eedc4345092308dc7b778b9f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638521000591710490%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=9SvCcewvWvO7SCU%2Bch1YloTw5eYlqXR7uWiGcfKuPEQ%3D&reserved=0<https://cran.r-project.org/web/packages/wig/readme/README.html>) that
>>> > recommends using rtracklayer for larger files. I guess that I could
>>> > also try using megadepth https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fresearch.libd.org%2Fmegadepth%2F&data=05%7C02%7C%7C3ba45fd4eedc4345092308dc7b778b9f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638521000591714093%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=I6zyz4PtFwkhlZKir7qyPtFTO31Ld5qI0jVpfiFSvbg%3D&reserved=0<https://research.libd.org/megadepth/>, though
>>> > derfinder::loadCoverage uses rtracklayer::import(as = "RleList") for
>>> > efficiency
>>> >
>>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flcolladotor%2Fderfinder%2Fblob%2Ff9cd986e0c1b9ea6551d0d8d2077d4501216a661%2FR%2FloadCoverage.R%23L401&data=05%7C02%7C%7C3ba45fd4eedc4345092308dc7b778b9f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638521000591717632%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=PjCbc4clTWCMvYYBXqed%2FtcigAfeNpwLxSXTY2HSviQ%3D&reserved=0<https://github.com/lcolladotor/derfinder/blob/f9cd986e0c1b9ea6551d0d8d2077d4501216a661/R/loadCoverage.R#L401>
>>> > and lots of functions in that package were built for that structure
>>> > (RleList objects). I likely missed other alternatives.
>>> >
>>> >
>>> > My current line of thought is to keep implementing workarounds using
>>> > local data (sometimes with pre-computed results) for recount,
>>> > recountWorkflow, and brainflowprobes (derfinder only has tests with
>>> > local bigWig files) without really altering the internals of those
>>> > packages. That is, assume that the remote BigWig file access via
>>> > rtracklayer will indefinitely be suspended, though it could be
>>> > supported again at some point and when it does, those packages will
>>> > work again with remote BigWig files as if nothing ever happened. But I
>>> > wanted to check in if this is what others who use BigWig files are
>>> > thinking of doing.
>>> >
>>> > Thanks!
>>> >
>>> > Best,
>>> > Leo
>>> >
>>> >
>>> > Leonardo Collado Torres, Ph. D.
>>> > Investigator, LIEBER INSTITUTE for BRAIN DEVELOPMENT
>>> > Assistant Professor, Department of Biostatistics
>>> > Johns Hopkins Bloomberg School of Public Health
>>> > 855 N. Wolfe St., Room 382
>>> <https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.google.com%2Fmaps%2Fsearch%2F855%2BN.%2BWolfe%2BSt.%2C%2BRoom%2B382%2B%250D%250A%2BBaltimore%2C%2BMD%2B21205%3Fentry%3Dgmail%26source%3Dg&data=05%7C02%7C%7C3ba45fd4eedc4345092308dc7b778b9f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638521000591721275%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=hFNrN%2Bg5iY7hkXFsjfweaIFHuGOqH3d%2FsCQ60yU4V8g%3D&reserved=0<https://www.google.com/maps/search/855+N.+Wolfe+St.,+Room+382+%0D%0A+Baltimore,+MD+21205?entry=gmail&source=g>>
>>>
>>> <https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.google.com%2Fmaps%2Fsearch%2F855%2BN.%2BWolfe%2BSt.%2C%2BRoom%2B382%2B%250D%250A%2BBaltimore%2C%2BMD%2B21205%3Fentry%3Dgmail%26source%3Dg&data=05%7C02%7C%7C3ba45fd4eedc4345092308dc7b778b9f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638521000591724906%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=qQynS72MuxoEE%2BBbr8wVgLVJ0CCRqUaPsqfDVGqlWyY%3D&reserved=0<https://www.google.com/maps/search/855+N.+Wolfe+St.,+Room+382+%0D%0A+Baltimore,+MD+21205?entry=gmail&source=g>>>
>>> Baltimore, MD 21205
>>> <https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.google.com%2Fmaps%2Fsearch%2F855%2BN.%2BWolfe%2BSt.%2C%2BRoom%2B382%2B%250D%250A%2BBaltimore%2C%2BMD%2B21205%3Fentry%3Dgmail%26source%3Dg&data=05%7C02%7C%7C3ba45fd4eedc4345092308dc7b778b9f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638521000591728513%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=52L7fIJLqcV2iWMFSv3fz0tkqcsainsoO8QDhMUg0EE%3D&reserved=0<https://www.google.com/maps/search/855+N.+Wolfe+St.,+Room+382+%0D%0A+Baltimore,+MD+21205?entry=gmail&source=g>>
>>> > lcolladotor.github.io
>>> > lcolladotor using gmail.com
>>> >
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioc-devel using r-project.org mailing list
>>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fbioc-devel&data=05%7C02%7C%7C3ba45fd4eedc4345092308dc7b778b9f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638521000591732025%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=NNzJS0d4WZaADgL6jQ%2BqPD7mE7xzrO1EP%2FJmCI8Rfds%3D&reserved=0<https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>>
>>
>> The information in this email is intended only for the person to whom it
>> is addressed. If you believe this e-mail was sent to you in error and
>> the email contains patient information, please contact the Partners
>> Compliance HelpLine at https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.partners.org%2Fcomplianceline&data=05%7C02%7C%7C3ba45fd4eedc4345092308dc7b778b9f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638521000591735606%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=9jdYYMqbSMXBlYRzlOXWquF6GnugC9ze4GqQi75baz8%3D&reserved=0<http://www.partners.org/complianceline> . If the
>> email was sent to you in error but does not contain patient information,
>> please contact the sender and properly dispose of the email.
>
>
--
The information in this email is intended only for the p...{{dropped:15}}
_______________________________________________
Bioc-devel using r-project.org mailing list
https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fbioc-devel&data=05%7C02%7C%7C3ba45fd4eedc4345092308dc7b778b9f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638521000591740728%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=0wuwp5PAw1x4f6yBVUbcyTwT3MEkKbNQy9SEjuIXMXc%3D&reserved=0<https://stat.ethz.ch/mailman/listinfo/bioc-devel>
[[alternative HTML version deleted]]
More information about the Bioc-devel
mailing list