[Bioc-devel] GenomicFiles: chunking

Mon Sep 29 16:18:49 CEST 2014

On Mon, Sep 29, 2014 at 9:09 AM, Kasper Daniel Hansen
<kasperdanielhansen at gmail.com> wrote:
> I don't fully understand "the use case for reducing by range is when the
> entire dataset won't fit into memory".  The basic assumption of these
> functions (as far as I can see) is that the output data fits in memory.
> What may not fit in memory is various earlier "iterations" of the data.  For
> example, in my use case, if I just read in all the data in all the ranges in
> all the samples it is basically Rle's across 450MB times 38 files, which is
> not small.  What fits in memory is smaller chunks of this; that is true for
> every application.
>

I was unclear. I meant that, in approach1, you have an object,
all.Rle, which contains Rles for every range over every file. Can you
actually run this approach on the full dataset?

> Reducing by range (or file) only makes sense when the final output includes
> one entity for several ranges/files ... right?  So I don't see how reduce
> would help me.
>

Yes, I think we agree. This is not a good use case for reduce by range
as now implemented.

This is a use case which would benefit from the user-facing function
internally calling pack()/unpack() to reduce the number of import()
calls, and then in the end giving back the mean coverage over the
input ranges. I want this too.

https://github.com/Bioconductor/GenomicFileViews/issues/2#issuecomment-32625456
(link to the old github repo, the new github repo is named GenomicFiles)

> As I see the pack()/unpack() paradigm, it just re-orders the query ranges
> (which is super nice and matters a lot for speed for some applications).  As
> I understand the code (and my understanding is developing) we need an extra
> layer to support processing multiple ranges in one operation.
>
> I am happy to help apart from complaining.
>
> Best,
> Kasper
>
> On Mon, Sep 29, 2014 at 8:55 AM, Michael Love <michaelisaiahlove at gmail.com>
> wrote:
>>
>> Thanks for checking it out and benchmarking. We should be more clear
>> in the docs that the use case for reducing by range is when the entire
>> dataset won't fit into memory. Also, we had some discussion and
>> Valerie had written up methods for packing up the ranges supplied by
>> the user into a better form for querying files. In your case it would
>> have packed many ranges together, to reduce the number of import()
>> calls like your naive approach. See the pack/unpack functions, which
>> are not in the vignette but are in the man pages. If I remember,
>> writing code to unpack() the result was not so simple, and development
>> of these functions was set aside for the moment.
>>
>> Mike
>>
>> On Sun, Sep 28, 2014 at 10:49 PM, Kasper Daniel Hansen
>> <kasperdanielhansen at gmail.com> wrote:
>> >
>> > I am testing GenomicFiles.
>> >
>> > My use case: I have 231k ranges of average width 1.9kb and total width
>> > 442
>> > MB.  I also have 38 BigWig files.  I want to compute the average
>> > coverage
>> > of the 38 BigWig files inside each range.  This is similar to wanting to
>> > get coverage of - say - all promoters in the human genome.
>> >
>> > My data is residing on a file server which is connected to the compute
>> > node
>> > through ethernet, so basically I have very slow file access.  Also. the
>> > BigWig files are (perhaps unusually) big: ~2GB / piece.
>> >
>> > Below I have two approaches, one using basically straightforward code
>> > and
>> > the other using GenomicFiles (examples are 10 / 100 ranges on 5 files).
>> > Basically GenomicFiles is 10-20x slower than the straightforward
>> > approach.
>> > This is most likely because reduceByRange/reduceByFile processes each
>> > (range, file) separately.
>> >
>> > It seems naturally (to me) to allow some chunking of the mapping of the
>> > ranges.  My naive approach is fast (I assume) because I read multiple
>> > ranges through one import() command.  I know I have random access to the
>> > BigWig files, but there must still be some overhead of seeking and
>> > perhaps
>> > more importantly instantiating/moving stuff back and forth to R.  So
>> > basically I would like to be able to write a MAP function which takes
>> >   ranges, file
>> > instead of just
>> >   range, file
>> > And then chunk over say 1,000s of ranges.  I could then have an argument
>> > to
>> > reduceByXX called something like rangeSize, which is kind of yieldSize.
>> >
>> > Perhaps this is what is intended for the reduceByYield on BigWig files?
>> >
>> > In a way, this is what is done in the vignette with the
>> > coverage(BAMFILE)
>> > example where tileGenome is essentially constructed by the user to chunk
>> > the coverage computation.
>> >
>> > I think the example of a set of regions I am querying on the files, will
>> > be
>> > an extremely common usecase going forward. The raw data for all the
>> > regions
>> > together is "too big" but I do a computation on each region to reduce
>> > the
>> > size.  In this situation, all the focus is on the MAP step.  I see the
>> > need
>> > for REDUCE in the case of the t-test example in the vignette, where the
>> > return object is a single "thing" for each base.  But in general, I
>> > think
>> > we will use these file structures a lot to construct things without
>> > REDUCE
>> > (across neither files nor ranges).
>> >
>> > Also, something completely different, it seems like it would be
>> > convenient
>> > for stuff like BigWigFileViews to not have to actually parse the file in
>> > the MAP step.  Somehow I would envision some kind of reading function,
>> > stored inside the object, which just returns an Rle when I ask for a
>> > (range, file).  Perhaps this is better left for later.
>> >
>> > Best,
>> > Kasper
>> >
>> > Examples
>> >
>> > approach1 <- function(ranges, files) {
>> >     ## By hand
>> >     all.Rle <- lapply(files, function(file) {
>> >         rle <- import(file, as = "Rle", which = ranges, format =
>> > "bw")[ranges]
>> >         rle
>> >     })
>> >     print(object.size(all.Rle), units = "auto")
>> >     mat <- do.call(cbind, lapply(all.Rle, function(xx) {
>> >         sapply(xx, mean)
>> >     }))
>> >     invisible(mat)
>> > }
>> >
>> > system.time({
>> > mat1 <- approach1(all.grs[1:10], all.files[1:5])
>> > })
>> >
>> > 160.9 Kb
>> >     user  system elapsed
>> >    1.109   0.001   1.113
>> >
>> > system.time({
>> > mat1 <- approach1(all.grs[1:100], all.files[1:5])
>> > }) # less than 4x slower than previous call
>> >
>> > 3 Mb
>> >     user  system elapsed
>> >    4.101   0.019   4.291
>> >
>> >
>> > approach2 <- function(ranges, files) {
>> >     gf <- GenomicFiles(rowData = ranges, files = files)
>> >     MAPPER <- function(range, file, ....) {
>> >         library(rtracklayer)
>> >         rle <- import(file, which = range, as = "Rle", format =
>> > "bw")[range]
>> >         mean(rle)
>> >     }
>> >     sExp <- reduceByRange(gf, MAP = MAPPER, summarize = TRUE, BPPARAM =
>> > SerialParam())
>> >     sExp
>> > }
>> >
>> > system.time({
>> > mat2 <- approach2(all.grs[1:10], all.files[1:5])
>> > })  # 8-9x slower than approach1
>> >
>> >    user  system elapsed
>> >    9.349   0.280   9.581
>> >
>> > system.time({
>> > mat2 <- approach2(all.grs[1:100], all.files[1:5])
>> > })  # 9x slower than previous call, 20x slow than approach1 on same
>> > input
>> >
>> >    user  system elapsed
>> >   89.310   0.627  91.044
>> >
>> >         [[alternative HTML version deleted]]
>> >
>> > _______________________________________________
>> > Bioc-devel at r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
>