[Bioc-devel] GenomicFiles: chunking

Mon Oct 27 20:07:13 CET 2014

hi Valerie,

this sounds good to me.

I am thinking of working on a function (here or elsewhere) that helps
decide, for reduce by range, how to optimally chunk GRanges into a
GRangesList. Practically, this could involve sampling the size of the
imported data for a subset of cells in the (ranges, files) matrix, and
asking/estimating the amount of memory available for each worker.

best,

Mike

On Mon, Oct 27, 2014 at 2:35 PM, Valerie Obenchain <vobencha at fhcrc.org>
wrote:

> Hi Kasper and Mike,
>
> I've added 2 new functions to GenomicFiles and deprecated the old classes.
> The vignette now has a graphic which (hopefully) clarifies the MAP / REDUCE
> mechanics of the different functions.
>
> Below is some performance testing for the new functions and answers to
> leftover questions from previous emails.
>
>
> Major changes to GenomicFiles (in devel):
>
> - *FileViews classes have been deprecated:
>
> The idea is to use the GenomicFiles class to hold any type of file be it
> BAM, BigWig, character vector etc. instead of having name-specific classes
> like BigWigFileViews. Currently GenomicFiles does not inherit from
> SummarizedExperiment but it may in the future.
>
> - Add reduceFiles() and reduceRanges():
>
> These functions pass all ranges or files to MAP vs the lapply approach
> taken in reduceByFiles() and reduceByRanges().
>
>
> (1) Performance:
>
> When testing with reduceByFile() you noted "GenomicFiles" is 10-20x slower
> than the straightforward approach". You also noted this was probably
> because of the lapply over all ranges - true. (Most likely there was
> overhead in creating the SE object as well.) With the new reduceFiles(),
> passing all ranges at once, we see performance very similar to that of the
> 'by hand' approach.
>
> In the test code I've used Bam instead of BigWig. Both test functions
> output lists, have comparable MAP and REDUCE steps etc.
>
> I used 5 files ('fls') and a granges ('grs') of length 100.
> > length(grs)
> [1] 100
>
> > sum(width(grs))
> [1] 1000000
>
> FUN1 is the 'by hand' version. These results are similar to what you saw,
> not quite a 4x difference between 10 and 100 ranges.
>
> >> microbenchmark(FUN1(grs[1:10], fls), FUN1(grs[1:100], fls), times=10)
> > Unit: seconds
> >                   expr      min       lq     mean   median       uq
>  max
> >   FUN1(grs[1:10], fls) 1.177858 1.190239 1.206127 1.201331 1.222298
> 1.256741
> >  FUN1(grs[1:100], fls) 4.145503 4.163404 4.249619 4.208486 4.278463
> 4.533846
> >  neval
> >     10
> >     10
>
> FUN2 is the reduceFiles() approach and the results are very similar to
> FUN1.
>
> >> microbenchmark(FUN2(grs[1:10], fls), FUN2(grs[1:100], fls), times=10)
> > Unit: seconds
> >                   expr      min       lq     mean   median       uq
>  max
> >   FUN2(grs[1:10], fls) 1.242767 1.251188 1.257531 1.253154 1.267655
> 1.275698
> >  FUN2(grs[1:100], fls) 4.251010 4.340061 4.390290 4.361007 4.384064
> 4.676068
> >  neval
> >     10
> >     10
>
>
> (2) Request for "chunking of the mapping of ranges":
>
> For now we decided not to add a 'yieldSize' argument for chunking. There
> are 2 approaches to chunking through ranges *within* the same file. In both
> cases the user spits the ranges, either before calling the function or in
> the MAP step.
>
> i) reduceByFile() with a GRangesList:
>
> The user provides a GRangesList as the 'ranges' arg. On each worker,
> lapply applies MAP to the one files and all elements of the GRangesList.
>
> ii) reduceFiles() with a MAP that handles chunking:
>
> The user split ranges in MAP and uses lapply or another loop to iterate.
> For example,
>
> MAP <- function(range, file, ...) {
>     lst = split(range, someFactor)
>     someFUN = function(rngs, file, ...) do something
>     lapply(lst, FUN=someFun, ...)
> }
>
> The same ideas apply for chunking though ranges *across* files with
> reduceByRange() and reduceRanges().
>
> iii) reduceByRange() with a GRangesList:
>
> Mike has a good example here:
> https://gist.github.com/mikelove/deaff999984dc75f125d
>
> iv) reduceRanges():
>
> 'ranges' should be a GRangesList. The MAP step will operate on an element
> of the GRangesList and all files. Unless you want to operate on all files
> at once I'd use reduceByRange() instead.
>
>
> (3) Return objects have different shape:
>
> Previous question:
>
> "...
> Why does the return object of reduceByFile vs reduceByRange (with
> summarize = FALSE) different?  I understand why internally you have
> different nesting schemes (files and ranges) for the two functions, but it
> is not clear to me that it is desirable to have the return object depend on
> how the computation was done.
> ..."
>
> reduceByFile() and reduceFiles() output a list the same length as the
> number of files while reduceByRange() and reduceRanges() output a list the
> same length as the number of ranges.
>
> Reduction is different depending on which function is chosen; data are
> collapsed either within a file or across files. When REDUCE does something
> substantial the output are not equivalent.
>
> While it's possible to get the same result (REDUCE simply unlists or isn't
> used), the two approaches were not intended to be equivalent ways of
> arriving at the same end. The idea was that the user had a specific use
> case in mind - they either wanted to collapse the data across or within
> files.
>
>
> (4) return object from coverage(BigWigFileViews):
>
> Previous comment:
>
> "...
> coverage(BigWigFileViews) returns a "wrong" assay object in my opinion,
> ...
> Specifically, each (i,j) entry in the object is an RleList with a single
> element with a name equal to the seqnames of the i'th entry in the query
> GRanges.  To me, this extra nestedness is unnecessary; I would have
> expected an Rle instead of an RleList with 1 element.
> ..."
>
> The return value from coverage(x) is an RleList with one coverage vector
> per seqlevel in 'x'. Even if there is only one seqlevel, the result still
> comes back as an RleList. This is just the default behavior.
>
>
> (5) separate the 'read' function from the MAP step
>
> Previous comment:
>
> "...
> Also, something completely different, it seems like it would be convenient
> for stuff like BigWigFileViews to not have to actually parse the file in
> the MAP step.  Somehow I would envision some kind of reading function,
> stored inside the object, which just returns an Rle when I ask for a
> (range, file).  Perhaps this is better left for later.
> ..."
>
> The current approach for the reduce* functions is for MAP to both extract
> and manipulate data. The idea of separating the extraction step is actually
> implemented in reduceByYield(). (This function used to be yieldReduce() in
> Rsamtools in past releases.) For reduceByYield() teh user must specify
> YIELD (a reader function), MAP, REDUCE and DONE (criteria to stop
> iteration).
>
> I'm not sure what is best here. I thought the many-argument approach of
> reduceByYield() was possibly confusing or burdensome and so didn't use it
> in the other GenomicFiles functions. Maybe it's not confusing but instead
> makes the individual steps more clear. What do you think,
>
> - Should the reader function be separate from the MAP? What are the
> advantages?
>
> - Should READER, MAP, REDUCE be stored inside the GenomicFiles object or
> supplied as arguments to the functions?
>
>
> (6) unnamed assay in SummarizedExperiment
>
> Previous comment:
>
> "...
> The return object of reduceByRange / reduceByFile with summarize = TRUE is
> a SummarizedExperiment with an unnamed assay.  I was surprised to see that
> this is even possible.
> ..."
>
> There is no default name for SummarizedExperiment in general. I've named
> the assay 'data' for lack of a better term. We could also go with
> 'reducedData' or another suggestion.
>
>
> Thanks for the feedback.
>
> Valerie
>
>
>
>
> On 10/01/2014 08:30 AM, Michael Love wrote:
>
>> hi Kasper,
>>
>> For a concrete example, I posted a R and Rout file here:
>>
>> https://gist.github.com/mikelove/deaff999984dc75f125d
>>
>> Things to note: 'ranges' is a GRangesList, I cbind() the numeric
>> vectors in the REDUCE, and then rbind() the final list to get the
>> desired matrix.
>>
>> Other than the weird column name 'init', does this give you what you want?
>>
>> best,
>>
>> Mike
>>
>> On Tue, Sep 30, 2014 at 2:08 PM, Michael Love
>> <michaelisaiahlove at gmail.com> wrote:
>>
>>> hi Kasper and Valerie,
>>>
>>> In Kasper's original email:
>>>
>>> "I would like to be able to write a MAP function which takes
>>>    ranges, file
>>> instead of just
>>>    range, file
>>> And then chunk over say 1,000s of ranges. I could then have an
>>> argument to reduceByXX called something like rangeSize, which is kind
>>> of yieldSize."
>>>
>>> I was thinking, for your BigWig example, we get part of the way just
>>> by splitting the ranges into a GRangesList of your desired chunk size.
>>> Then the parallel iteration is over chunks of GRanges. Within your MAP
>>> function:
>>>
>>> import(file, which = ranges, as = "Rle", format = "bw")[ranges]
>>>
>>> returns an RleList, and calling mean() on this gives a numeric vector
>>> of the mean of the coverage for each range.
>>>
>>> Then the only work needed is how to package the result into something
>>> reasonable. What you would get now is a list (length of GRangesList)
>>> of lists (length of files) of vectors (length of GRanges).
>>>
>>> best,
>>>
>>> Mike
>>>
>>> On Mon, Sep 29, 2014 at 7:09 PM, Valerie Obenchain <vobencha at fhcrc.org>
>>> wrote:
>>>
>>>> Hi Kasper,
>>>>
>>>> The reduceBy* functions were intended to combine data across ranges or
>>>> files. If we have 10 ranges and 3 files you can think of it as a 10 x 3
>>>> grid
>>>> where we'll have 30 queries and therefore 30 pieces of information that
>>>> can
>>>> be combined across different dimensions.
>>>>
>>>> The request to 'processes ranges in one operation' is a good suggestion
>>>> and,
>>>> as you say, may even be the more common use case.
>>>>
>>>> Some action items and explanations:
>>>>
>>>> 1)  treat ranges / files as a group
>>>>
>>>> I'll add functions to treat all ranges / files as a group; essentially
>>>> no
>>>> REDUCER other than concatenation. Chunking (optional) will occur as
>>>> defined
>>>> by 'yieldSize' in the BamFile.
>>>>
>>>> 2)  class clean-up
>>>>
>>>> The GenomicFileViews class was the first iteration. It was overly
>>>> complicated and the complication didn't gain us much. In my mind the
>>>> GenomicFiles class is a striped down version should replace the
>>>> *FileViews
>>>> classes. I plan to deprecate the *FileViews classes.
>>>>
>>>> Ideally the class(s) in GenomicFiles would inherit from
>>>> SummarizedExperiment. A stand-alone package for SummarizedExperiment is
>>>> in
>>>> the works for the near future. Until those plans are finalized I'd
>>>> rather
>>>> not change the inheritance structure in GenomicFiles.
>>>>
>>>> 3) pack / unpack
>>>>
>>>> I experimented with this an came to the conclusion that packing /
>>>> unpacking
>>>> should be left to the user vs being done behind the scenes with
>>>> reduceBy*.
>>>> The primary difficulty is that you don't have pre-knowledge of the
>>>> output
>>>> format of the user's MAPPER. If MAPPER outputs an Rle, unpacking may be
>>>> straightforward. If MAPPER outputs a NumericList or matrix or vector
>>>> with no
>>>> genomic coordinates then things are more complicated.
>>>>
>>>> I'm open if others have suggestions / prototypes.
>>>>
>>>> 4) reduceByYield
>>>>
>>>> This was intended for chunking through ranges in a single file. You can
>>>> imagine using bplapply() over files where each file is chunked through
>>>> with
>>>> reduceByYield(). All ranges are chunked by 'yieldSize' defined in the
>>>> BamFile unless a 'param' dictates a subset of ranges.
>>>>
>>>> 5) additional tidy
>>>>
>>>> I'll add a name to the 'assays' when summarize=TRUE, make sure return
>>>> types
>>>> are consistent when summarize=FALSE, etc.
>>>>
>>>> Thanks for the testing and feedback.
>>>>
>>>>
>>>> Valerie
>>>>
>>>>
>>>> On 09/29/2014 07:18 AM, Michael Love wrote:
>>>>
>>>>>
>>>>> On Mon, Sep 29, 2014 at 9:09 AM, Kasper Daniel Hansen
>>>>> <kasperdanielhansen at gmail.com> wrote:
>>>>>
>>>>>>
>>>>>> I don't fully understand "the use case for reducing by range is when
>>>>>> the
>>>>>> entire dataset won't fit into memory".  The basic assumption of these
>>>>>> functions (as far as I can see) is that the output data fits in
>>>>>> memory.
>>>>>> What may not fit in memory is various earlier "iterations" of the
>>>>>> data.
>>>>>> For
>>>>>> example, in my use case, if I just read in all the data in all the
>>>>>> ranges
>>>>>> in
>>>>>> all the samples it is basically Rle's across 450MB times 38 files,
>>>>>> which
>>>>>> is
>>>>>> not small.  What fits in memory is smaller chunks of this; that is
>>>>>> true
>>>>>> for
>>>>>> every application.
>>>>>>
>>>>>>
>>>>> I was unclear. I meant that, in approach1, you have an object,
>>>>> all.Rle, which contains Rles for every range over every file. Can you
>>>>> actually run this approach on the full dataset?
>>>>>
>>>>>  Reducing by range (or file) only makes sense when the final output
>>>>>> includes
>>>>>> one entity for several ranges/files ... right?  So I don't see how
>>>>>> reduce
>>>>>> would help me.
>>>>>>
>>>>>>
>>>>> Yes, I think we agree. This is not a good use case for reduce by range
>>>>> as now implemented.
>>>>>
>>>>> This is a use case which would benefit from the user-facing function
>>>>> internally calling pack()/unpack() to reduce the number of import()
>>>>> calls, and then in the end giving back the mean coverage over the
>>>>> input ranges. I want this too.
>>>>>
>>>>>
>>>>> https://github.com/Bioconductor/GenomicFileViews/
>>>>> issues/2#issuecomment-32625456
>>>>> (link to the old github repo, the new github repo is named
>>>>> GenomicFiles)
>>>>>
>>>>>  As I see the pack()/unpack() paradigm, it just re-orders the query
>>>>>> ranges
>>>>>> (which is super nice and matters a lot for speed for some
>>>>>> applications).
>>>>>> As
>>>>>> I understand the code (and my understanding is developing) we need an
>>>>>> extra
>>>>>> layer to support processing multiple ranges in one operation.
>>>>>>
>>>>>> I am happy to help apart from complaining.
>>>>>>
>>>>>> Best,
>>>>>> Kasper
>>>>>>
>>>>>> On Mon, Sep 29, 2014 at 8:55 AM, Michael Love
>>>>>> <michaelisaiahlove at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thanks for checking it out and benchmarking. We should be more clear
>>>>>>> in the docs that the use case for reducing by range is when the
>>>>>>> entire
>>>>>>> dataset won't fit into memory. Also, we had some discussion and
>>>>>>> Valerie had written up methods for packing up the ranges supplied by
>>>>>>> the user into a better form for querying files. In your case it would
>>>>>>> have packed many ranges together, to reduce the number of import()
>>>>>>> calls like your naive approach. See the pack/unpack functions, which
>>>>>>> are not in the vignette but are in the man pages. If I remember,
>>>>>>> writing code to unpack() the result was not so simple, and
>>>>>>> development
>>>>>>> of these functions was set aside for the moment.
>>>>>>>
>>>>>>> Mike
>>>>>>>
>>>>>>> On Sun, Sep 28, 2014 at 10:49 PM, Kasper Daniel Hansen
>>>>>>> <kasperdanielhansen at gmail.com> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I am testing GenomicFiles.
>>>>>>>>
>>>>>>>> My use case: I have 231k ranges of average width 1.9kb and total
>>>>>>>> width
>>>>>>>> 442
>>>>>>>> MB.  I also have 38 BigWig files.  I want to compute the average
>>>>>>>> coverage
>>>>>>>> of the 38 BigWig files inside each range.  This is similar to
>>>>>>>> wanting
>>>>>>>> to
>>>>>>>> get coverage of - say - all promoters in the human genome.
>>>>>>>>
>>>>>>>> My data is residing on a file server which is connected to the
>>>>>>>> compute
>>>>>>>> node
>>>>>>>> through ethernet, so basically I have very slow file access.  Also.
>>>>>>>> the
>>>>>>>> BigWig files are (perhaps unusually) big: ~2GB / piece.
>>>>>>>>
>>>>>>>> Below I have two approaches, one using basically straightforward
>>>>>>>> code
>>>>>>>> and
>>>>>>>> the other using GenomicFiles (examples are 10 / 100 ranges on 5
>>>>>>>> files).
>>>>>>>> Basically GenomicFiles is 10-20x slower than the straightforward
>>>>>>>> approach.
>>>>>>>> This is most likely because reduceByRange/reduceByFile processes
>>>>>>>> each
>>>>>>>> (range, file) separately.
>>>>>>>>
>>>>>>>> It seems naturally (to me) to allow some chunking of the mapping of
>>>>>>>> the
>>>>>>>> ranges.  My naive approach is fast (I assume) because I read
>>>>>>>> multiple
>>>>>>>> ranges through one import() command.  I know I have random access to
>>>>>>>> the
>>>>>>>> BigWig files, but there must still be some overhead of seeking and
>>>>>>>> perhaps
>>>>>>>> more importantly instantiating/moving stuff back and forth to R.  So
>>>>>>>> basically I would like to be able to write a MAP function which
>>>>>>>> takes
>>>>>>>>     ranges, file
>>>>>>>> instead of just
>>>>>>>>     range, file
>>>>>>>> And then chunk over say 1,000s of ranges.  I could then have an
>>>>>>>> argument
>>>>>>>> to
>>>>>>>> reduceByXX called something like rangeSize, which is kind of
>>>>>>>> yieldSize.
>>>>>>>>
>>>>>>>> Perhaps this is what is intended for the reduceByYield on BigWig
>>>>>>>> files?
>>>>>>>>
>>>>>>>> In a way, this is what is done in the vignette with the
>>>>>>>> coverage(BAMFILE)
>>>>>>>> example where tileGenome is essentially constructed by the user to
>>>>>>>> chunk
>>>>>>>> the coverage computation.
>>>>>>>>
>>>>>>>> I think the example of a set of regions I am querying on the files,
>>>>>>>> will
>>>>>>>> be
>>>>>>>> an extremely common usecase going forward. The raw data for all the
>>>>>>>> regions
>>>>>>>> together is "too big" but I do a computation on each region to
>>>>>>>> reduce
>>>>>>>> the
>>>>>>>> size.  In this situation, all the focus is on the MAP step.  I see
>>>>>>>> the
>>>>>>>> need
>>>>>>>> for REDUCE in the case of the t-test example in the vignette, where
>>>>>>>> the
>>>>>>>> return object is a single "thing" for each base.  But in general, I
>>>>>>>> think
>>>>>>>> we will use these file structures a lot to construct things without
>>>>>>>> REDUCE
>>>>>>>> (across neither files nor ranges).
>>>>>>>>
>>>>>>>> Also, something completely different, it seems like it would be
>>>>>>>> convenient
>>>>>>>> for stuff like BigWigFileViews to not have to actually parse the
>>>>>>>> file
>>>>>>>> in
>>>>>>>> the MAP step.  Somehow I would envision some kind of reading
>>>>>>>> function,
>>>>>>>> stored inside the object, which just returns an Rle when I ask for a
>>>>>>>> (range, file).  Perhaps this is better left for later.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Kasper
>>>>>>>>
>>>>>>>> Examples
>>>>>>>>
>>>>>>>> approach1 <- function(ranges, files) {
>>>>>>>>       ## By hand
>>>>>>>>       all.Rle <- lapply(files, function(file) {
>>>>>>>>           rle <- import(file, as = "Rle", which = ranges, format =
>>>>>>>> "bw")[ranges]
>>>>>>>>           rle
>>>>>>>>       })
>>>>>>>>       print(object.size(all.Rle), units = "auto")
>>>>>>>>       mat <- do.call(cbind, lapply(all.Rle, function(xx) {
>>>>>>>>           sapply(xx, mean)
>>>>>>>>       }))
>>>>>>>>       invisible(mat)
>>>>>>>> }
>>>>>>>>
>>>>>>>> system.time({
>>>>>>>> mat1 <- approach1(all.grs[1:10], all.files[1:5])
>>>>>>>> })
>>>>>>>>
>>>>>>>> 160.9 Kb
>>>>>>>>       user  system elapsed
>>>>>>>>      1.109   0.001   1.113
>>>>>>>>
>>>>>>>> system.time({
>>>>>>>> mat1 <- approach1(all.grs[1:100], all.files[1:5])
>>>>>>>> }) # less than 4x slower than previous call
>>>>>>>>
>>>>>>>> 3 Mb
>>>>>>>>       user  system elapsed
>>>>>>>>      4.101   0.019   4.291
>>>>>>>>
>>>>>>>>
>>>>>>>> approach2 <- function(ranges, files) {
>>>>>>>>       gf <- GenomicFiles(rowData = ranges, files = files)
>>>>>>>>       MAPPER <- function(range, file, ....) {
>>>>>>>>           library(rtracklayer)
>>>>>>>>           rle <- import(file, which = range, as = "Rle", format =
>>>>>>>> "bw")[range]
>>>>>>>>           mean(rle)
>>>>>>>>       }
>>>>>>>>       sExp <- reduceByRange(gf, MAP = MAPPER, summarize = TRUE,
>>>>>>>> BPPARAM
>>>>>>>> =
>>>>>>>> SerialParam())
>>>>>>>>       sExp
>>>>>>>> }
>>>>>>>>
>>>>>>>> system.time({
>>>>>>>> mat2 <- approach2(all.grs[1:10], all.files[1:5])
>>>>>>>> })  # 8-9x slower than approach1
>>>>>>>>
>>>>>>>>      user  system elapsed
>>>>>>>>      9.349   0.280   9.581
>>>>>>>>
>>>>>>>> system.time({
>>>>>>>> mat2 <- approach2(all.grs[1:100], all.files[1:5])
>>>>>>>> })  # 9x slower than previous call, 20x slow than approach1 on same
>>>>>>>> input
>>>>>>>>
>>>>>>>>      user  system elapsed
>>>>>>>>     89.310   0.627  91.044
>>>>>>>>
>>>>>>>>           [[alternative HTML version deleted]]
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> _______________________________________________
>>>>> Bioc-devel at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>
>>>>>
>>>>
>

	[[alternative HTML version deleted]]