[Bioc-devel] GenomicFiles: chunking

Mon Sep 29 14:55:22 CEST 2014

Thanks for checking it out and benchmarking. We should be more clear
in the docs that the use case for reducing by range is when the entire
dataset won't fit into memory. Also, we had some discussion and
Valerie had written up methods for packing up the ranges supplied by
the user into a better form for querying files. In your case it would
have packed many ranges together, to reduce the number of import()
calls like your naive approach. See the pack/unpack functions, which
are not in the vignette but are in the man pages. If I remember,
writing code to unpack() the result was not so simple, and development
of these functions was set aside for the moment.

Mike

On Sun, Sep 28, 2014 at 10:49 PM, Kasper Daniel Hansen
<kasperdanielhansen at gmail.com> wrote:
>
> I am testing GenomicFiles.
>
> My use case: I have 231k ranges of average width 1.9kb and total width 442
> MB.  I also have 38 BigWig files.  I want to compute the average coverage
> of the 38 BigWig files inside each range.  This is similar to wanting to
> get coverage of - say - all promoters in the human genome.
>
> My data is residing on a file server which is connected to the compute node
> through ethernet, so basically I have very slow file access.  Also. the
> BigWig files are (perhaps unusually) big: ~2GB / piece.
>
> Below I have two approaches, one using basically straightforward code and
> the other using GenomicFiles (examples are 10 / 100 ranges on 5 files).
> Basically GenomicFiles is 10-20x slower than the straightforward approach.
> This is most likely because reduceByRange/reduceByFile processes each
> (range, file) separately.
>
> It seems naturally (to me) to allow some chunking of the mapping of the
> ranges.  My naive approach is fast (I assume) because I read multiple
> ranges through one import() command.  I know I have random access to the
> BigWig files, but there must still be some overhead of seeking and perhaps
> more importantly instantiating/moving stuff back and forth to R.  So
> basically I would like to be able to write a MAP function which takes
>   ranges, file
> instead of just
>   range, file
> And then chunk over say 1,000s of ranges.  I could then have an argument to
> reduceByXX called something like rangeSize, which is kind of yieldSize.
>
> Perhaps this is what is intended for the reduceByYield on BigWig files?
>
> In a way, this is what is done in the vignette with the coverage(BAMFILE)
> example where tileGenome is essentially constructed by the user to chunk
> the coverage computation.
>
> I think the example of a set of regions I am querying on the files, will be
> an extremely common usecase going forward. The raw data for all the regions
> together is "too big" but I do a computation on each region to reduce the
> size.  In this situation, all the focus is on the MAP step.  I see the need
> for REDUCE in the case of the t-test example in the vignette, where the
> return object is a single "thing" for each base.  But in general, I think
> we will use these file structures a lot to construct things without REDUCE
> (across neither files nor ranges).
>
> Also, something completely different, it seems like it would be convenient
> for stuff like BigWigFileViews to not have to actually parse the file in
> the MAP step.  Somehow I would envision some kind of reading function,
> stored inside the object, which just returns an Rle when I ask for a
> (range, file).  Perhaps this is better left for later.
>
> Best,
> Kasper
>
> Examples
>
> approach1 <- function(ranges, files) {
>     ## By hand
>     all.Rle <- lapply(files, function(file) {
>         rle <- import(file, as = "Rle", which = ranges, format =
> "bw")[ranges]
>         rle
>     })
>     print(object.size(all.Rle), units = "auto")
>     mat <- do.call(cbind, lapply(all.Rle, function(xx) {
>         sapply(xx, mean)
>     }))
>     invisible(mat)
> }
>
> system.time({
> mat1 <- approach1(all.grs[1:10], all.files[1:5])
> })
>
> 160.9 Kb
>     user  system elapsed
>    1.109   0.001   1.113
>
> system.time({
> mat1 <- approach1(all.grs[1:100], all.files[1:5])
> }) # less than 4x slower than previous call
>
> 3 Mb
>     user  system elapsed
>    4.101   0.019   4.291
>
>
> approach2 <- function(ranges, files) {
>     gf <- GenomicFiles(rowData = ranges, files = files)
>     MAPPER <- function(range, file, ....) {
>         library(rtracklayer)
>         rle <- import(file, which = range, as = "Rle", format = "bw")[range]
>         mean(rle)
>     }
>     sExp <- reduceByRange(gf, MAP = MAPPER, summarize = TRUE, BPPARAM =
> SerialParam())
>     sExp
> }
>
> system.time({
> mat2 <- approach2(all.grs[1:10], all.files[1:5])
> })  # 8-9x slower than approach1
>
>    user  system elapsed
>    9.349   0.280   9.581
>
> system.time({
> mat2 <- approach2(all.grs[1:100], all.files[1:5])
> })  # 9x slower than previous call, 20x slow than approach1 on same input
>
>    user  system elapsed
>   89.310   0.627  91.044
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel