[Bioc-devel] GenomicFiles: chunking

Michael Love michaelisaiahlove at gmail.com
Tue Sep 30 20:08:47 CEST 2014


hi Kasper and Valerie,

In Kasper's original email:

"I would like to be able to write a MAP function which takes
  ranges, file
instead of just
  range, file
And then chunk over say 1,000s of ranges. I could then have an
argument to reduceByXX called something like rangeSize, which is kind
of yieldSize."

I was thinking, for your BigWig example, we get part of the way just
by splitting the ranges into a GRangesList of your desired chunk size.
Then the parallel iteration is over chunks of GRanges. Within your MAP
function:

import(file, which = ranges, as = "Rle", format = "bw")[ranges]

returns an RleList, and calling mean() on this gives a numeric vector
of the mean of the coverage for each range.

Then the only work needed is how to package the result into something
reasonable. What you would get now is a list (length of GRangesList)
of lists (length of files) of vectors (length of GRanges).

best,

Mike

On Mon, Sep 29, 2014 at 7:09 PM, Valerie Obenchain <vobencha at fhcrc.org> wrote:
> Hi Kasper,
>
> The reduceBy* functions were intended to combine data across ranges or
> files. If we have 10 ranges and 3 files you can think of it as a 10 x 3 grid
> where we'll have 30 queries and therefore 30 pieces of information that can
> be combined across different dimensions.
>
> The request to 'processes ranges in one operation' is a good suggestion and,
> as you say, may even be the more common use case.
>
> Some action items and explanations:
>
> 1)  treat ranges / files as a group
>
> I'll add functions to treat all ranges / files as a group; essentially no
> REDUCER other than concatenation. Chunking (optional) will occur as defined
> by 'yieldSize' in the BamFile.
>
> 2)  class clean-up
>
> The GenomicFileViews class was the first iteration. It was overly
> complicated and the complication didn't gain us much. In my mind the
> GenomicFiles class is a striped down version should replace the *FileViews
> classes. I plan to deprecate the *FileViews classes.
>
> Ideally the class(s) in GenomicFiles would inherit from
> SummarizedExperiment. A stand-alone package for SummarizedExperiment is in
> the works for the near future. Until those plans are finalized I'd rather
> not change the inheritance structure in GenomicFiles.
>
> 3) pack / unpack
>
> I experimented with this an came to the conclusion that packing / unpacking
> should be left to the user vs being done behind the scenes with reduceBy*.
> The primary difficulty is that you don't have pre-knowledge of the output
> format of the user's MAPPER. If MAPPER outputs an Rle, unpacking may be
> straightforward. If MAPPER outputs a NumericList or matrix or vector with no
> genomic coordinates then things are more complicated.
>
> I'm open if others have suggestions / prototypes.
>
> 4) reduceByYield
>
> This was intended for chunking through ranges in a single file. You can
> imagine using bplapply() over files where each file is chunked through with
> reduceByYield(). All ranges are chunked by 'yieldSize' defined in the
> BamFile unless a 'param' dictates a subset of ranges.
>
> 5) additional tidy
>
> I'll add a name to the 'assays' when summarize=TRUE, make sure return types
> are consistent when summarize=FALSE, etc.
>
> Thanks for the testing and feedback.
>
>
> Valerie
>
>
> On 09/29/2014 07:18 AM, Michael Love wrote:
>>
>> On Mon, Sep 29, 2014 at 9:09 AM, Kasper Daniel Hansen
>> <kasperdanielhansen at gmail.com> wrote:
>>>
>>> I don't fully understand "the use case for reducing by range is when the
>>> entire dataset won't fit into memory".  The basic assumption of these
>>> functions (as far as I can see) is that the output data fits in memory.
>>> What may not fit in memory is various earlier "iterations" of the data.
>>> For
>>> example, in my use case, if I just read in all the data in all the ranges
>>> in
>>> all the samples it is basically Rle's across 450MB times 38 files, which
>>> is
>>> not small.  What fits in memory is smaller chunks of this; that is true
>>> for
>>> every application.
>>>
>>
>> I was unclear. I meant that, in approach1, you have an object,
>> all.Rle, which contains Rles for every range over every file. Can you
>> actually run this approach on the full dataset?
>>
>>> Reducing by range (or file) only makes sense when the final output
>>> includes
>>> one entity for several ranges/files ... right?  So I don't see how reduce
>>> would help me.
>>>
>>
>> Yes, I think we agree. This is not a good use case for reduce by range
>> as now implemented.
>>
>> This is a use case which would benefit from the user-facing function
>> internally calling pack()/unpack() to reduce the number of import()
>> calls, and then in the end giving back the mean coverage over the
>> input ranges. I want this too.
>>
>>
>> https://github.com/Bioconductor/GenomicFileViews/issues/2#issuecomment-32625456
>> (link to the old github repo, the new github repo is named GenomicFiles)
>>
>>> As I see the pack()/unpack() paradigm, it just re-orders the query ranges
>>> (which is super nice and matters a lot for speed for some applications).
>>> As
>>> I understand the code (and my understanding is developing) we need an
>>> extra
>>> layer to support processing multiple ranges in one operation.
>>>
>>> I am happy to help apart from complaining.
>>>
>>> Best,
>>> Kasper
>>>
>>> On Mon, Sep 29, 2014 at 8:55 AM, Michael Love
>>> <michaelisaiahlove at gmail.com>
>>> wrote:
>>>>
>>>>
>>>> Thanks for checking it out and benchmarking. We should be more clear
>>>> in the docs that the use case for reducing by range is when the entire
>>>> dataset won't fit into memory. Also, we had some discussion and
>>>> Valerie had written up methods for packing up the ranges supplied by
>>>> the user into a better form for querying files. In your case it would
>>>> have packed many ranges together, to reduce the number of import()
>>>> calls like your naive approach. See the pack/unpack functions, which
>>>> are not in the vignette but are in the man pages. If I remember,
>>>> writing code to unpack() the result was not so simple, and development
>>>> of these functions was set aside for the moment.
>>>>
>>>> Mike
>>>>
>>>> On Sun, Sep 28, 2014 at 10:49 PM, Kasper Daniel Hansen
>>>> <kasperdanielhansen at gmail.com> wrote:
>>>>>
>>>>>
>>>>> I am testing GenomicFiles.
>>>>>
>>>>> My use case: I have 231k ranges of average width 1.9kb and total width
>>>>> 442
>>>>> MB.  I also have 38 BigWig files.  I want to compute the average
>>>>> coverage
>>>>> of the 38 BigWig files inside each range.  This is similar to wanting
>>>>> to
>>>>> get coverage of - say - all promoters in the human genome.
>>>>>
>>>>> My data is residing on a file server which is connected to the compute
>>>>> node
>>>>> through ethernet, so basically I have very slow file access.  Also. the
>>>>> BigWig files are (perhaps unusually) big: ~2GB / piece.
>>>>>
>>>>> Below I have two approaches, one using basically straightforward code
>>>>> and
>>>>> the other using GenomicFiles (examples are 10 / 100 ranges on 5 files).
>>>>> Basically GenomicFiles is 10-20x slower than the straightforward
>>>>> approach.
>>>>> This is most likely because reduceByRange/reduceByFile processes each
>>>>> (range, file) separately.
>>>>>
>>>>> It seems naturally (to me) to allow some chunking of the mapping of the
>>>>> ranges.  My naive approach is fast (I assume) because I read multiple
>>>>> ranges through one import() command.  I know I have random access to
>>>>> the
>>>>> BigWig files, but there must still be some overhead of seeking and
>>>>> perhaps
>>>>> more importantly instantiating/moving stuff back and forth to R.  So
>>>>> basically I would like to be able to write a MAP function which takes
>>>>>    ranges, file
>>>>> instead of just
>>>>>    range, file
>>>>> And then chunk over say 1,000s of ranges.  I could then have an
>>>>> argument
>>>>> to
>>>>> reduceByXX called something like rangeSize, which is kind of yieldSize.
>>>>>
>>>>> Perhaps this is what is intended for the reduceByYield on BigWig files?
>>>>>
>>>>> In a way, this is what is done in the vignette with the
>>>>> coverage(BAMFILE)
>>>>> example where tileGenome is essentially constructed by the user to
>>>>> chunk
>>>>> the coverage computation.
>>>>>
>>>>> I think the example of a set of regions I am querying on the files,
>>>>> will
>>>>> be
>>>>> an extremely common usecase going forward. The raw data for all the
>>>>> regions
>>>>> together is "too big" but I do a computation on each region to reduce
>>>>> the
>>>>> size.  In this situation, all the focus is on the MAP step.  I see the
>>>>> need
>>>>> for REDUCE in the case of the t-test example in the vignette, where the
>>>>> return object is a single "thing" for each base.  But in general, I
>>>>> think
>>>>> we will use these file structures a lot to construct things without
>>>>> REDUCE
>>>>> (across neither files nor ranges).
>>>>>
>>>>> Also, something completely different, it seems like it would be
>>>>> convenient
>>>>> for stuff like BigWigFileViews to not have to actually parse the file
>>>>> in
>>>>> the MAP step.  Somehow I would envision some kind of reading function,
>>>>> stored inside the object, which just returns an Rle when I ask for a
>>>>> (range, file).  Perhaps this is better left for later.
>>>>>
>>>>> Best,
>>>>> Kasper
>>>>>
>>>>> Examples
>>>>>
>>>>> approach1 <- function(ranges, files) {
>>>>>      ## By hand
>>>>>      all.Rle <- lapply(files, function(file) {
>>>>>          rle <- import(file, as = "Rle", which = ranges, format =
>>>>> "bw")[ranges]
>>>>>          rle
>>>>>      })
>>>>>      print(object.size(all.Rle), units = "auto")
>>>>>      mat <- do.call(cbind, lapply(all.Rle, function(xx) {
>>>>>          sapply(xx, mean)
>>>>>      }))
>>>>>      invisible(mat)
>>>>> }
>>>>>
>>>>> system.time({
>>>>> mat1 <- approach1(all.grs[1:10], all.files[1:5])
>>>>> })
>>>>>
>>>>> 160.9 Kb
>>>>>      user  system elapsed
>>>>>     1.109   0.001   1.113
>>>>>
>>>>> system.time({
>>>>> mat1 <- approach1(all.grs[1:100], all.files[1:5])
>>>>> }) # less than 4x slower than previous call
>>>>>
>>>>> 3 Mb
>>>>>      user  system elapsed
>>>>>     4.101   0.019   4.291
>>>>>
>>>>>
>>>>> approach2 <- function(ranges, files) {
>>>>>      gf <- GenomicFiles(rowData = ranges, files = files)
>>>>>      MAPPER <- function(range, file, ....) {
>>>>>          library(rtracklayer)
>>>>>          rle <- import(file, which = range, as = "Rle", format =
>>>>> "bw")[range]
>>>>>          mean(rle)
>>>>>      }
>>>>>      sExp <- reduceByRange(gf, MAP = MAPPER, summarize = TRUE, BPPARAM
>>>>> =
>>>>> SerialParam())
>>>>>      sExp
>>>>> }
>>>>>
>>>>> system.time({
>>>>> mat2 <- approach2(all.grs[1:10], all.files[1:5])
>>>>> })  # 8-9x slower than approach1
>>>>>
>>>>>     user  system elapsed
>>>>>     9.349   0.280   9.581
>>>>>
>>>>> system.time({
>>>>> mat2 <- approach2(all.grs[1:100], all.files[1:5])
>>>>> })  # 9x slower than previous call, 20x slow than approach1 on same
>>>>> input
>>>>>
>>>>>     user  system elapsed
>>>>>    89.310   0.627  91.044
>>>>>
>>>>>          [[alternative HTML version deleted]]
>>>>>
>>>>> _______________________________________________
>>>>> Bioc-devel at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>>
>>>
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>



More information about the Bioc-devel mailing list