[Bioc-devel] GenomicFiles: chunking

Tue Sep 30 01:09:18 CEST 2014

Hi Kasper,

The reduceBy* functions were intended to combine data across ranges or 
files. If we have 10 ranges and 3 files you can think of it as a 10 x 3 
grid where we'll have 30 queries and therefore 30 pieces of information 
that can be combined across different dimensions.

The request to 'processes ranges in one operation' is a good suggestion 
and, as you say, may even be the more common use case.

Some action items and explanations:

1)  treat ranges / files as a group

I'll add functions to treat all ranges / files as a group; essentially 
no REDUCER other than concatenation. Chunking (optional) will occur as 
defined by 'yieldSize' in the BamFile.

2)  class clean-up

The GenomicFileViews class was the first iteration. It was overly 
complicated and the complication didn't gain us much. In my mind the 
GenomicFiles class is a striped down version should replace the 
*FileViews classes. I plan to deprecate the *FileViews classes.

Ideally the class(s) in GenomicFiles would inherit from 
SummarizedExperiment. A stand-alone package for SummarizedExperiment is 
in the works for the near future. Until those plans are finalized I'd 
rather not change the inheritance structure in GenomicFiles.

3) pack / unpack

I experimented with this an came to the conclusion that packing / 
unpacking should be left to the user vs being done behind the scenes 
with reduceBy*. The primary difficulty is that you don't have 
pre-knowledge of the output format of the user's MAPPER. If MAPPER 
outputs an Rle, unpacking may be straightforward. If MAPPER outputs a 
NumericList or matrix or vector with no genomic coordinates then things 
are more complicated.

I'm open if others have suggestions / prototypes.

4) reduceByYield

This was intended for chunking through ranges in a single file. You can 
imagine using bplapply() over files where each file is chunked through 
with reduceByYield(). All ranges are chunked by 'yieldSize' defined in 
the BamFile unless a 'param' dictates a subset of ranges.

5) additional tidy

I'll add a name to the 'assays' when summarize=TRUE, make sure return 
types are consistent when summarize=FALSE, etc.

Thanks for the testing and feedback.

Valerie

On 09/29/2014 07:18 AM, Michael Love wrote:
> On Mon, Sep 29, 2014 at 9:09 AM, Kasper Daniel Hansen
> <kasperdanielhansen at gmail.com> wrote:
>> I don't fully understand "the use case for reducing by range is when the
>> entire dataset won't fit into memory".  The basic assumption of these
>> functions (as far as I can see) is that the output data fits in memory.
>> What may not fit in memory is various earlier "iterations" of the data.  For
>> example, in my use case, if I just read in all the data in all the ranges in
>> all the samples it is basically Rle's across 450MB times 38 files, which is
>> not small.  What fits in memory is smaller chunks of this; that is true for
>> every application.
>>
>
> I was unclear. I meant that, in approach1, you have an object,
> all.Rle, which contains Rles for every range over every file. Can you
> actually run this approach on the full dataset?
>
>> Reducing by range (or file) only makes sense when the final output includes
>> one entity for several ranges/files ... right?  So I don't see how reduce
>> would help me.
>>
>
> Yes, I think we agree. This is not a good use case for reduce by range
> as now implemented.
>
> This is a use case which would benefit from the user-facing function
> internally calling pack()/unpack() to reduce the number of import()
> calls, and then in the end giving back the mean coverage over the
> input ranges. I want this too.
>
> https://github.com/Bioconductor/GenomicFileViews/issues/2#issuecomment-32625456
> (link to the old github repo, the new github repo is named GenomicFiles)
>
>> As I see the pack()/unpack() paradigm, it just re-orders the query ranges
>> (which is super nice and matters a lot for speed for some applications).  As
>> I understand the code (and my understanding is developing) we need an extra
>> layer to support processing multiple ranges in one operation.
>>
>> I am happy to help apart from complaining.
>>
>> Best,
>> Kasper
>>
>> On Mon, Sep 29, 2014 at 8:55 AM, Michael Love <michaelisaiahlove at gmail.com>
>> wrote:
>>>
>>> Thanks for checking it out and benchmarking. We should be more clear
>>> in the docs that the use case for reducing by range is when the entire
>>> dataset won't fit into memory. Also, we had some discussion and
>>> Valerie had written up methods for packing up the ranges supplied by
>>> the user into a better form for querying files. In your case it would
>>> have packed many ranges together, to reduce the number of import()
>>> calls like your naive approach. See the pack/unpack functions, which
>>> are not in the vignette but are in the man pages. If I remember,
>>> writing code to unpack() the result was not so simple, and development
>>> of these functions was set aside for the moment.
>>>
>>> Mike
>>>
>>> On Sun, Sep 28, 2014 at 10:49 PM, Kasper Daniel Hansen
>>> <kasperdanielhansen at gmail.com> wrote:
>>>>
>>>> I am testing GenomicFiles.
>>>>
>>>> My use case: I have 231k ranges of average width 1.9kb and total width
>>>> 442
>>>> MB.  I also have 38 BigWig files.  I want to compute the average
>>>> coverage
>>>> of the 38 BigWig files inside each range.  This is similar to wanting to
>>>> get coverage of - say - all promoters in the human genome.
>>>>
>>>> My data is residing on a file server which is connected to the compute
>>>> node
>>>> through ethernet, so basically I have very slow file access.  Also. the
>>>> BigWig files are (perhaps unusually) big: ~2GB / piece.
>>>>
>>>> Below I have two approaches, one using basically straightforward code
>>>> and
>>>> the other using GenomicFiles (examples are 10 / 100 ranges on 5 files).
>>>> Basically GenomicFiles is 10-20x slower than the straightforward
>>>> approach.
>>>> This is most likely because reduceByRange/reduceByFile processes each
>>>> (range, file) separately.
>>>>
>>>> It seems naturally (to me) to allow some chunking of the mapping of the
>>>> ranges.  My naive approach is fast (I assume) because I read multiple
>>>> ranges through one import() command.  I know I have random access to the
>>>> BigWig files, but there must still be some overhead of seeking and
>>>> perhaps
>>>> more importantly instantiating/moving stuff back and forth to R.  So
>>>> basically I would like to be able to write a MAP function which takes
>>>>    ranges, file
>>>> instead of just
>>>>    range, file
>>>> And then chunk over say 1,000s of ranges.  I could then have an argument
>>>> to
>>>> reduceByXX called something like rangeSize, which is kind of yieldSize.
>>>>
>>>> Perhaps this is what is intended for the reduceByYield on BigWig files?
>>>>
>>>> In a way, this is what is done in the vignette with the
>>>> coverage(BAMFILE)
>>>> example where tileGenome is essentially constructed by the user to chunk
>>>> the coverage computation.
>>>>
>>>> I think the example of a set of regions I am querying on the files, will
>>>> be
>>>> an extremely common usecase going forward. The raw data for all the
>>>> regions
>>>> together is "too big" but I do a computation on each region to reduce
>>>> the
>>>> size.  In this situation, all the focus is on the MAP step.  I see the
>>>> need
>>>> for REDUCE in the case of the t-test example in the vignette, where the
>>>> return object is a single "thing" for each base.  But in general, I
>>>> think
>>>> we will use these file structures a lot to construct things without
>>>> REDUCE
>>>> (across neither files nor ranges).
>>>>
>>>> Also, something completely different, it seems like it would be
>>>> convenient
>>>> for stuff like BigWigFileViews to not have to actually parse the file in
>>>> the MAP step.  Somehow I would envision some kind of reading function,
>>>> stored inside the object, which just returns an Rle when I ask for a
>>>> (range, file).  Perhaps this is better left for later.
>>>>
>>>> Best,
>>>> Kasper
>>>>
>>>> Examples
>>>>
>>>> approach1 <- function(ranges, files) {
>>>>      ## By hand
>>>>      all.Rle <- lapply(files, function(file) {
>>>>          rle <- import(file, as = "Rle", which = ranges, format =
>>>> "bw")[ranges]
>>>>          rle
>>>>      })
>>>>      print(object.size(all.Rle), units = "auto")
>>>>      mat <- do.call(cbind, lapply(all.Rle, function(xx) {
>>>>          sapply(xx, mean)
>>>>      }))
>>>>      invisible(mat)
>>>> }
>>>>
>>>> system.time({
>>>> mat1 <- approach1(all.grs[1:10], all.files[1:5])
>>>> })
>>>>
>>>> 160.9 Kb
>>>>      user  system elapsed
>>>>     1.109   0.001   1.113
>>>>
>>>> system.time({
>>>> mat1 <- approach1(all.grs[1:100], all.files[1:5])
>>>> }) # less than 4x slower than previous call
>>>>
>>>> 3 Mb
>>>>      user  system elapsed
>>>>     4.101   0.019   4.291
>>>>
>>>>
>>>> approach2 <- function(ranges, files) {
>>>>      gf <- GenomicFiles(rowData = ranges, files = files)
>>>>      MAPPER <- function(range, file, ....) {
>>>>          library(rtracklayer)
>>>>          rle <- import(file, which = range, as = "Rle", format =
>>>> "bw")[range]
>>>>          mean(rle)
>>>>      }
>>>>      sExp <- reduceByRange(gf, MAP = MAPPER, summarize = TRUE, BPPARAM =
>>>> SerialParam())
>>>>      sExp
>>>> }
>>>>
>>>> system.time({
>>>> mat2 <- approach2(all.grs[1:10], all.files[1:5])
>>>> })  # 8-9x slower than approach1
>>>>
>>>>     user  system elapsed
>>>>     9.349   0.280   9.581
>>>>
>>>> system.time({
>>>> mat2 <- approach2(all.grs[1:100], all.files[1:5])
>>>> })  # 9x slower than previous call, 20x slow than approach1 on same
>>>> input
>>>>
>>>>     user  system elapsed
>>>>    89.310   0.627  91.044
>>>>
>>>>          [[alternative HTML version deleted]]
>>>>
>>>> _______________________________________________
>>>> Bioc-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>