[Bioc-devel] Remote BigWig file access

Leonardo Collado Torres |co||@dotor @end|ng |rom gm@||@com
Tue May 21 20:40:44 CEST 2024


Hi Bioc-devel,

As some of you are aware, rtracklayer::import() has long provided
access to import BigWig files. Those files can be shared on servers
and accessed remotely thanks to all the effort from many of you in
building and maintaining rtracklayer.

>From my side, derfinder::loadCoverage() relies on
rtracklayer::import.bw(), and recount::expressed_regions() +
recount::coverage_matrix() use derfinder::loadCoverage().
recountWorkflow showcases those recount functions on larger datasets.
brainflowprobes by Amanda Price, Nina Rajpurohit and others also ends
up relying on rtracklayer::import.bw() through these functions.

At https://github.com/lawremi/rtracklayer/issues/83 I initially
reported some issues once our recount2/3 data host changed, but
previously Brian Schilder also reported that one could no longer read
remote files https://github.com/lawremi/rtracklayer/issues/73.
https://github.com/lawremi/rtracklayer/issues/63 and/or
https://github.com/lawremi/rtracklayer/issues/65 might have been
related.

Yesterday I updated
https://github.com/lawremi/rtracklayer/issues/83#issuecomment-2121313270
with a comment showing some small reproducible code, and that the
workaround of downloading the data first, then using
rtracklayer::import() on the local data does work. However, this
workaround does involve a lot of, hmm, wasteful data transfer.

On the recount vignette at some point I access just chrY of a bigWig
file that is about 1300 MB. On the recountWorkflow vignette I do
something similar for a 7GB bigWig file. Previously accessing just
chrY on these files was a small data transfer.

On recountWorkflow version 1.29.2
https://github.com/LieberInstitute/recountWorkflow, I've included
pre-computed results (~2 MB) to avoid downloading tons of data, though
the vignette code shows how to actually fully reproduce the results if
you don't mind downloading those large files. I also implemented some
workarounds on recount, though I haven't yet gone the full route of
including pre-computed results. I have yet to try implementing a
workaround for brainflowprobes.



My understanding is that rtracklayer's root issues are elsewhere and
changes in dependencies rtracklayer has likely created these problems.
These problems are not always in the control of rtracklayer authors to
resolve, and also create an unexpected burden on them.

If one considers alternatives to rtracklayer, I see that there's a new
package https://github.com/PoisonAlien/trackplot/tree/master that uses
bwtool (a system dependency), and older alternative
https://github.com/andrelmartins/bigWig that hasn't had updates in 4
years, and a CRAN package
(https://cran.r-project.org/web/packages/wig/readme/README.html) that
recommends using rtracklayer for larger files. I guess that I could
also try using megadepth https://research.libd.org/megadepth/, though
derfinder::loadCoverage uses rtracklayer::import(as = "RleList") for
efficiency https://github.com/lcolladotor/derfinder/blob/f9cd986e0c1b9ea6551d0d8d2077d4501216a661/R/loadCoverage.R#L401
and lots of functions in that package were built for that structure
(RleList objects). I likely missed other alternatives.


My current line of thought is to keep implementing workarounds using
local data (sometimes with pre-computed results) for recount,
recountWorkflow, and brainflowprobes (derfinder only has tests with
local bigWig files) without really altering the internals of those
packages. That is, assume that the remote BigWig file access via
rtracklayer will indefinitely be suspended, though it could be
supported again at some point and when it does, those packages will
work again with remote BigWig files as if nothing ever happened. But I
wanted to check in if this is what others who use BigWig files are
thinking of doing.

Thanks!

Best,
Leo


Leonardo Collado Torres, Ph. D.
Investigator, LIEBER INSTITUTE for BRAIN DEVELOPMENT
Assistant Professor, Department of Biostatistics
Johns Hopkins Bloomberg School of Public Health
855 N. Wolfe St., Room 382
Baltimore, MD 21205
lcolladotor.github.io
lcolladotor using gmail.com



More information about the Bioc-devel mailing list