[Bioc-devel] GenomicFiles: chunking
Michael Love
michaelisaiahlove at gmail.com
Mon Sep 29 16:18:49 CEST 2014
On Mon, Sep 29, 2014 at 9:09 AM, Kasper Daniel Hansen
<kasperdanielhansen at gmail.com> wrote:
> I don't fully understand "the use case for reducing by range is when the
> entire dataset won't fit into memory". The basic assumption of these
> functions (as far as I can see) is that the output data fits in memory.
> What may not fit in memory is various earlier "iterations" of the data. For
> example, in my use case, if I just read in all the data in all the ranges in
> all the samples it is basically Rle's across 450MB times 38 files, which is
> not small. What fits in memory is smaller chunks of this; that is true for
> every application.
>
I was unclear. I meant that, in approach1, you have an object,
all.Rle, which contains Rles for every range over every file. Can you
actually run this approach on the full dataset?
> Reducing by range (or file) only makes sense when the final output includes
> one entity for several ranges/files ... right? So I don't see how reduce
> would help me.
>
Yes, I think we agree. This is not a good use case for reduce by range
as now implemented.
This is a use case which would benefit from the user-facing function
internally calling pack()/unpack() to reduce the number of import()
calls, and then in the end giving back the mean coverage over the
input ranges. I want this too.
https://github.com/Bioconductor/GenomicFileViews/issues/2#issuecomment-32625456
(link to the old github repo, the new github repo is named GenomicFiles)
> As I see the pack()/unpack() paradigm, it just re-orders the query ranges
> (which is super nice and matters a lot for speed for some applications). As
> I understand the code (and my understanding is developing) we need an extra
> layer to support processing multiple ranges in one operation.
>
> I am happy to help apart from complaining.
>
> Best,
> Kasper
>
> On Mon, Sep 29, 2014 at 8:55 AM, Michael Love <michaelisaiahlove at gmail.com>
> wrote:
>>
>> Thanks for checking it out and benchmarking. We should be more clear
>> in the docs that the use case for reducing by range is when the entire
>> dataset won't fit into memory. Also, we had some discussion and
>> Valerie had written up methods for packing up the ranges supplied by
>> the user into a better form for querying files. In your case it would
>> have packed many ranges together, to reduce the number of import()
>> calls like your naive approach. See the pack/unpack functions, which
>> are not in the vignette but are in the man pages. If I remember,
>> writing code to unpack() the result was not so simple, and development
>> of these functions was set aside for the moment.
>>
>> Mike
>>
>> On Sun, Sep 28, 2014 at 10:49 PM, Kasper Daniel Hansen
>> <kasperdanielhansen at gmail.com> wrote:
>> >
>> > I am testing GenomicFiles.
>> >
>> > My use case: I have 231k ranges of average width 1.9kb and total width
>> > 442
>> > MB. I also have 38 BigWig files. I want to compute the average
>> > coverage
>> > of the 38 BigWig files inside each range. This is similar to wanting to
>> > get coverage of - say - all promoters in the human genome.
>> >
>> > My data is residing on a file server which is connected to the compute
>> > node
>> > through ethernet, so basically I have very slow file access. Also. the
>> > BigWig files are (perhaps unusually) big: ~2GB / piece.
>> >
>> > Below I have two approaches, one using basically straightforward code
>> > and
>> > the other using GenomicFiles (examples are 10 / 100 ranges on 5 files).
>> > Basically GenomicFiles is 10-20x slower than the straightforward
>> > approach.
>> > This is most likely because reduceByRange/reduceByFile processes each
>> > (range, file) separately.
>> >
>> > It seems naturally (to me) to allow some chunking of the mapping of the
>> > ranges. My naive approach is fast (I assume) because I read multiple
>> > ranges through one import() command. I know I have random access to the
>> > BigWig files, but there must still be some overhead of seeking and
>> > perhaps
>> > more importantly instantiating/moving stuff back and forth to R. So
>> > basically I would like to be able to write a MAP function which takes
>> > ranges, file
>> > instead of just
>> > range, file
>> > And then chunk over say 1,000s of ranges. I could then have an argument
>> > to
>> > reduceByXX called something like rangeSize, which is kind of yieldSize.
>> >
>> > Perhaps this is what is intended for the reduceByYield on BigWig files?
>> >
>> > In a way, this is what is done in the vignette with the
>> > coverage(BAMFILE)
>> > example where tileGenome is essentially constructed by the user to chunk
>> > the coverage computation.
>> >
>> > I think the example of a set of regions I am querying on the files, will
>> > be
>> > an extremely common usecase going forward. The raw data for all the
>> > regions
>> > together is "too big" but I do a computation on each region to reduce
>> > the
>> > size. In this situation, all the focus is on the MAP step. I see the
>> > need
>> > for REDUCE in the case of the t-test example in the vignette, where the
>> > return object is a single "thing" for each base. But in general, I
>> > think
>> > we will use these file structures a lot to construct things without
>> > REDUCE
>> > (across neither files nor ranges).
>> >
>> > Also, something completely different, it seems like it would be
>> > convenient
>> > for stuff like BigWigFileViews to not have to actually parse the file in
>> > the MAP step. Somehow I would envision some kind of reading function,
>> > stored inside the object, which just returns an Rle when I ask for a
>> > (range, file). Perhaps this is better left for later.
>> >
>> > Best,
>> > Kasper
>> >
>> > Examples
>> >
>> > approach1 <- function(ranges, files) {
>> > ## By hand
>> > all.Rle <- lapply(files, function(file) {
>> > rle <- import(file, as = "Rle", which = ranges, format =
>> > "bw")[ranges]
>> > rle
>> > })
>> > print(object.size(all.Rle), units = "auto")
>> > mat <- do.call(cbind, lapply(all.Rle, function(xx) {
>> > sapply(xx, mean)
>> > }))
>> > invisible(mat)
>> > }
>> >
>> > system.time({
>> > mat1 <- approach1(all.grs[1:10], all.files[1:5])
>> > })
>> >
>> > 160.9 Kb
>> > user system elapsed
>> > 1.109 0.001 1.113
>> >
>> > system.time({
>> > mat1 <- approach1(all.grs[1:100], all.files[1:5])
>> > }) # less than 4x slower than previous call
>> >
>> > 3 Mb
>> > user system elapsed
>> > 4.101 0.019 4.291
>> >
>> >
>> > approach2 <- function(ranges, files) {
>> > gf <- GenomicFiles(rowData = ranges, files = files)
>> > MAPPER <- function(range, file, ....) {
>> > library(rtracklayer)
>> > rle <- import(file, which = range, as = "Rle", format =
>> > "bw")[range]
>> > mean(rle)
>> > }
>> > sExp <- reduceByRange(gf, MAP = MAPPER, summarize = TRUE, BPPARAM =
>> > SerialParam())
>> > sExp
>> > }
>> >
>> > system.time({
>> > mat2 <- approach2(all.grs[1:10], all.files[1:5])
>> > }) # 8-9x slower than approach1
>> >
>> > user system elapsed
>> > 9.349 0.280 9.581
>> >
>> > system.time({
>> > mat2 <- approach2(all.grs[1:100], all.files[1:5])
>> > }) # 9x slower than previous call, 20x slow than approach1 on same
>> > input
>> >
>> > user system elapsed
>> > 89.310 0.627 91.044
>> >
>> > [[alternative HTML version deleted]]
>> >
>> > _______________________________________________
>> > Bioc-devel at r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
>
More information about the Bioc-devel
mailing list