[Bioc-devel] GenomicFiles: chunking

Mon Oct 27 20:32:10 CET 2014

Sounds great. I think GenomicFiles is a good place for such a function - 
it's along the lines of what we wanted to accomplish with pack / unpack.

Maybe your new function can be used by pack once finished. There's 
definitely room for expanding that functionality.

Valerie

On 10/27/2014 12:07 PM, Michael Love wrote:
> hi Valerie,
>
> this sounds good to me.
>
> I am thinking of working on a function (here or elsewhere) that helps
> decide, for reduce by range, how to optimally chunk GRanges into a
> GRangesList. Practically, this could involve sampling the size of the
> imported data for a subset of cells in the (ranges, files) matrix, and
> asking/estimating the amount of memory available for each worker.
>
> best,
>
> Mike
>
>
> On Mon, Oct 27, 2014 at 2:35 PM, Valerie Obenchain <vobencha at fhcrc.org
> <mailto:vobencha at fhcrc.org>> wrote:
>
>     Hi Kasper and Mike,
>
>     I've added 2 new functions to GenomicFiles and deprecated the old
>     classes. The vignette now has a graphic which (hopefully) clarifies
>     the MAP / REDUCE mechanics of the different functions.
>
>     Below is some performance testing for the new functions and answers
>     to leftover questions from previous emails.
>
>
>     Major changes to GenomicFiles (in devel):
>
>     - *FileViews classes have been deprecated:
>
>     The idea is to use the GenomicFiles class to hold any type of file
>     be it BAM, BigWig, character vector etc. instead of having
>     name-specific classes like BigWigFileViews. Currently GenomicFiles
>     does not inherit from SummarizedExperiment but it may in the future.
>
>     - Add reduceFiles() and reduceRanges():
>
>     These functions pass all ranges or files to MAP vs the lapply
>     approach taken in reduceByFiles() and reduceByRanges().
>
>
>     (1) Performance:
>
>     When testing with reduceByFile() you noted "GenomicFiles" is 10-20x
>     slower than the straightforward approach". You also noted this was
>     probably because of the lapply over all ranges - true. (Most likely
>     there was overhead in creating the SE object as well.) With the new
>     reduceFiles(), passing all ranges at once, we see performance very
>     similar to that of the 'by hand' approach.
>
>     In the test code I've used Bam instead of BigWig. Both test
>     functions output lists, have comparable MAP and REDUCE steps etc.
>
>     I used 5 files ('fls') and a granges ('grs') of length 100.
>      > length(grs)
>     [1] 100
>
>      > sum(width(grs))
>     [1] 1000000
>
>     FUN1 is the 'by hand' version. These results are similar to what you
>     saw, not quite a 4x difference between 10 and 100 ranges.
>
>      >> microbenchmark(FUN1(grs[1:10], fls), FUN1(grs[1:100], fls),
>     times=10)
>      > Unit: seconds
>      >                   expr      min       lq     mean   median
>       uq     max
>      >   FUN1(grs[1:10], fls) 1.177858 1.190239 1.206127 1.201331
>     1.222298 1.256741
>      >  FUN1(grs[1:100], fls) 4.145503 4.163404 4.249619 4.208486
>     4.278463 4.533846
>      >  neval
>      >     10
>      >     10
>
>     FUN2 is the reduceFiles() approach and the results are very similar
>     to FUN1.
>
>      >> microbenchmark(FUN2(grs[1:10], fls), FUN2(grs[1:100], fls),
>     times=10)
>      > Unit: seconds
>      >                   expr      min       lq     mean   median
>       uq     max
>      >   FUN2(grs[1:10], fls) 1.242767 1.251188 1.257531 1.253154
>     1.267655 1.275698
>      >  FUN2(grs[1:100], fls) 4.251010 4.340061 4.390290 4.361007
>     4.384064 4.676068
>      >  neval
>      >     10
>      >     10
>
>
>     (2) Request for "chunking of the mapping of ranges":
>
>     For now we decided not to add a 'yieldSize' argument for chunking.
>     There are 2 approaches to chunking through ranges *within* the same
>     file. In both cases the user spits the ranges, either before calling
>     the function or in the MAP step.
>
>     i) reduceByFile() with a GRangesList:
>
>     The user provides a GRangesList as the 'ranges' arg. On each worker,
>     lapply applies MAP to the one files and all elements of the GRangesList.
>
>     ii) reduceFiles() with a MAP that handles chunking:
>
>     The user split ranges in MAP and uses lapply or another loop to
>     iterate. For example,
>
>     MAP <- function(range, file, ...) {
>          lst = split(range, someFactor)
>          someFUN = function(rngs, file, ...) do something
>          lapply(lst, FUN=someFun, ...)
>     }
>
>     The same ideas apply for chunking though ranges *across* files with
>     reduceByRange() and reduceRanges().
>
>     iii) reduceByRange() with a GRangesList:
>
>     Mike has a good example here:
>     https://gist.github.com/__mikelove/deaff999984dc75f125d
>     <https://gist.github.com/mikelove/deaff999984dc75f125d>
>
>     iv) reduceRanges():
>
>     'ranges' should be a GRangesList. The MAP step will operate on an
>     element of the GRangesList and all files. Unless you want to operate
>     on all files at once I'd use reduceByRange() instead.
>
>
>     (3) Return objects have different shape:
>
>     Previous question:
>
>     "...
>     Why does the return object of reduceByFile vs reduceByRange (with
>     summarize = FALSE) different?  I understand why internally you have
>     different nesting schemes (files and ranges) for the two functions,
>     but it is not clear to me that it is desirable to have the return
>     object depend on how the computation was done.
>     ..."
>
>     reduceByFile() and reduceFiles() output a list the same length as
>     the number of files while reduceByRange() and reduceRanges() output
>     a list the same length as the number of ranges.
>
>     Reduction is different depending on which function is chosen; data
>     are collapsed either within a file or across files. When REDUCE does
>     something substantial the output are not equivalent.
>
>     While it's possible to get the same result (REDUCE simply unlists or
>     isn't used), the two approaches were not intended to be equivalent
>     ways of arriving at the same end. The idea was that the user had a
>     specific use case in mind - they either wanted to collapse the data
>     across or within files.
>
>
>     (4) return object from coverage(BigWigFileViews):
>
>     Previous comment:
>
>     "...
>     coverage(BigWigFileViews) returns a "wrong" assay object in my opinion,
>     ...
>     Specifically, each (i,j) entry in the object is an RleList with a single
>     element with a name equal to the seqnames of the i'th entry in the query
>     GRanges.  To me, this extra nestedness is unnecessary; I would have
>     expected an Rle instead of an RleList with 1 element.
>     ..."
>
>     The return value from coverage(x) is an RleList with one coverage
>     vector per seqlevel in 'x'. Even if there is only one seqlevel, the
>     result still comes back as an RleList. This is just the default
>     behavior.
>
>
>     (5) separate the 'read' function from the MAP step
>
>     Previous comment:
>
>     "...
>     Also, something completely different, it seems like it would be
>     convenient for stuff like BigWigFileViews to not have to actually
>     parse the file in the MAP step.  Somehow I would envision some kind
>     of reading function, stored inside the object, which just returns an
>     Rle when I ask for a (range, file).  Perhaps this is better left for
>     later.
>     ..."
>
>     The current approach for the reduce* functions is for MAP to both
>     extract and manipulate data. The idea of separating the extraction
>     step is actually implemented in reduceByYield(). (This function used
>     to be yieldReduce() in Rsamtools in past releases.) For
>     reduceByYield() teh user must specify YIELD (a reader function),
>     MAP, REDUCE and DONE (criteria to stop iteration).
>
>     I'm not sure what is best here. I thought the many-argument approach
>     of reduceByYield() was possibly confusing or burdensome and so
>     didn't use it in the other GenomicFiles functions. Maybe it's not
>     confusing but instead makes the individual steps more clear. What do
>     you think,
>
>     - Should the reader function be separate from the MAP? What are the
>     advantages?
>
>     - Should READER, MAP, REDUCE be stored inside the GenomicFiles
>     object or supplied as arguments to the functions?
>
>
>     (6) unnamed assay in SummarizedExperiment
>
>     Previous comment:
>
>     "...
>     The return object of reduceByRange / reduceByFile with summarize =
>     TRUE is a SummarizedExperiment with an unnamed assay.  I was
>     surprised to see that this is even possible.
>     ..."
>
>     There is no default name for SummarizedExperiment in general. I've
>     named the assay 'data' for lack of a better term. We could also go
>     with 'reducedData' or another suggestion.
>
>
>     Thanks for the feedback.
>
>     Valerie
>
>
>
>
>     On 10/01/2014 08:30 AM, Michael Love wrote:
>
>         hi Kasper,
>
>         For a concrete example, I posted a R and Rout file here:
>
>         https://gist.github.com/__mikelove/deaff999984dc75f125d
>         <https://gist.github.com/mikelove/deaff999984dc75f125d>
>
>         Things to note: 'ranges' is a GRangesList, I cbind() the numeric
>         vectors in the REDUCE, and then rbind() the final list to get the
>         desired matrix.
>
>         Other than the weird column name 'init', does this give you what
>         you want?
>
>         best,
>
>         Mike
>
>         On Tue, Sep 30, 2014 at 2:08 PM, Michael Love
>         <michaelisaiahlove at gmail.com
>         <mailto:michaelisaiahlove at gmail.com>> wrote:
>
>             hi Kasper and Valerie,
>
>             In Kasper's original email:
>
>             "I would like to be able to write a MAP function which takes
>                 ranges, file
>             instead of just
>                 range, file
>             And then chunk over say 1,000s of ranges. I could then have an
>             argument to reduceByXX called something like rangeSize,
>             which is kind
>             of yieldSize."
>
>             I was thinking, for your BigWig example, we get part of the
>             way just
>             by splitting the ranges into a GRangesList of your desired
>             chunk size.
>             Then the parallel iteration is over chunks of GRanges.
>             Within your MAP
>             function:
>
>             import(file, which = ranges, as = "Rle", format = "bw")[ranges]
>
>             returns an RleList, and calling mean() on this gives a
>             numeric vector
>             of the mean of the coverage for each range.
>
>             Then the only work needed is how to package the result into
>             something
>             reasonable. What you would get now is a list (length of
>             GRangesList)
>             of lists (length of files) of vectors (length of GRanges).
>
>             best,
>
>             Mike
>
>             On Mon, Sep 29, 2014 at 7:09 PM, Valerie Obenchain
>             <vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>> wrote:
>
>                 Hi Kasper,
>
>                 The reduceBy* functions were intended to combine data
>                 across ranges or
>                 files. If we have 10 ranges and 3 files you can think of
>                 it as a 10 x 3 grid
>                 where we'll have 30 queries and therefore 30 pieces of
>                 information that can
>                 be combined across different dimensions.
>
>                 The request to 'processes ranges in one operation' is a
>                 good suggestion and,
>                 as you say, may even be the more common use case.
>
>                 Some action items and explanations:
>
>                 1)  treat ranges / files as a group
>
>                 I'll add functions to treat all ranges / files as a
>                 group; essentially no
>                 REDUCER other than concatenation. Chunking (optional)
>                 will occur as defined
>                 by 'yieldSize' in the BamFile.
>
>                 2)  class clean-up
>
>                 The GenomicFileViews class was the first iteration. It
>                 was overly
>                 complicated and the complication didn't gain us much. In
>                 my mind the
>                 GenomicFiles class is a striped down version should
>                 replace the *FileViews
>                 classes. I plan to deprecate the *FileViews classes.
>
>                 Ideally the class(s) in GenomicFiles would inherit from
>                 SummarizedExperiment. A stand-alone package for
>                 SummarizedExperiment is in
>                 the works for the near future. Until those plans are
>                 finalized I'd rather
>                 not change the inheritance structure in GenomicFiles.
>
>                 3) pack / unpack
>
>                 I experimented with this an came to the conclusion that
>                 packing / unpacking
>                 should be left to the user vs being done behind the
>                 scenes with reduceBy*.
>                 The primary difficulty is that you don't have
>                 pre-knowledge of the output
>                 format of the user's MAPPER. If MAPPER outputs an Rle,
>                 unpacking may be
>                 straightforward. If MAPPER outputs a NumericList or
>                 matrix or vector with no
>                 genomic coordinates then things are more complicated.
>
>                 I'm open if others have suggestions / prototypes.
>
>                 4) reduceByYield
>
>                 This was intended for chunking through ranges in a
>                 single file. You can
>                 imagine using bplapply() over files where each file is
>                 chunked through with
>                 reduceByYield(). All ranges are chunked by 'yieldSize'
>                 defined in the
>                 BamFile unless a 'param' dictates a subset of ranges.
>
>                 5) additional tidy
>
>                 I'll add a name to the 'assays' when summarize=TRUE,
>                 make sure return types
>                 are consistent when summarize=FALSE, etc.
>
>                 Thanks for the testing and feedback.
>
>
>                 Valerie
>
>
>                 On 09/29/2014 07:18 AM, Michael Love wrote:
>
>
>                     On Mon, Sep 29, 2014 at 9:09 AM, Kasper Daniel Hansen
>                     <kasperdanielhansen at gmail.com
>                     <mailto:kasperdanielhansen at gmail.com>> wrote:
>
>
>                         I don't fully understand "the use case for
>                         reducing by range is when the
>                         entire dataset won't fit into memory".  The
>                         basic assumption of these
>                         functions (as far as I can see) is that the
>                         output data fits in memory.
>                         What may not fit in memory is various earlier
>                         "iterations" of the data.
>                         For
>                         example, in my use case, if I just read in all
>                         the data in all the ranges
>                         in
>                         all the samples it is basically Rle's across
>                         450MB times 38 files, which
>                         is
>                         not small.  What fits in memory is smaller
>                         chunks of this; that is true
>                         for
>                         every application.
>
>
>                     I was unclear. I meant that, in approach1, you have
>                     an object,
>                     all.Rle, which contains Rles for every range over
>                     every file. Can you
>                     actually run this approach on the full dataset?
>
>                         Reducing by range (or file) only makes sense
>                         when the final output
>                         includes
>                         one entity for several ranges/files ... right?
>                         So I don't see how reduce
>                         would help me.
>
>
>                     Yes, I think we agree. This is not a good use case
>                     for reduce by range
>                     as now implemented.
>
>                     This is a use case which would benefit from the
>                     user-facing function
>                     internally calling pack()/unpack() to reduce the
>                     number of import()
>                     calls, and then in the end giving back the mean
>                     coverage over the
>                     input ranges. I want this too.
>
>
>                     https://github.com/__Bioconductor/GenomicFileViews/__issues/2#issuecomment-32625456
>                     <https://github.com/Bioconductor/GenomicFileViews/issues/2#issuecomment-32625456>
>                     (link to the old github repo, the new github repo is
>                     named GenomicFiles)
>
>                         As I see the pack()/unpack() paradigm, it just
>                         re-orders the query ranges
>                         (which is super nice and matters a lot for speed
>                         for some applications).
>                         As
>                         I understand the code (and my understanding is
>                         developing) we need an
>                         extra
>                         layer to support processing multiple ranges in
>                         one operation.
>
>                         I am happy to help apart from complaining.
>
>                         Best,
>                         Kasper
>
>                         On Mon, Sep 29, 2014 at 8:55 AM, Michael Love
>                         <michaelisaiahlove at gmail.com
>                         <mailto:michaelisaiahlove at gmail.com>>
>                         wrote:
>
>
>
>                             Thanks for checking it out and benchmarking.
>                             We should be more clear
>                             in the docs that the use case for reducing
>                             by range is when the entire
>                             dataset won't fit into memory. Also, we had
>                             some discussion and
>                             Valerie had written up methods for packing
>                             up the ranges supplied by
>                             the user into a better form for querying
>                             files. In your case it would
>                             have packed many ranges together, to reduce
>                             the number of import()
>                             calls like your naive approach. See the
>                             pack/unpack functions, which
>                             are not in the vignette but are in the man
>                             pages. If I remember,
>                             writing code to unpack() the result was not
>                             so simple, and development
>                             of these functions was set aside for the moment.
>
>                             Mike
>
>                             On Sun, Sep 28, 2014 at 10:49 PM, Kasper
>                             Daniel Hansen
>                             <kasperdanielhansen at gmail.com
>                             <mailto:kasperdanielhansen at gmail.com>> wrote:
>
>
>
>                                 I am testing GenomicFiles.
>
>                                 My use case: I have 231k ranges of
>                                 average width 1.9kb and total width
>                                 442
>                                 MB.  I also have 38 BigWig files.  I
>                                 want to compute the average
>                                 coverage
>                                 of the 38 BigWig files inside each
>                                 range.  This is similar to wanting
>                                 to
>                                 get coverage of - say - all promoters in
>                                 the human genome.
>
>                                 My data is residing on a file server
>                                 which is connected to the compute
>                                 node
>                                 through ethernet, so basically I have
>                                 very slow file access.  Also. the
>                                 BigWig files are (perhaps unusually)
>                                 big: ~2GB / piece.
>
>                                 Below I have two approaches, one using
>                                 basically straightforward code
>                                 and
>                                 the other using GenomicFiles (examples
>                                 are 10 / 100 ranges on 5 files).
>                                 Basically GenomicFiles is 10-20x slower
>                                 than the straightforward
>                                 approach.
>                                 This is most likely because
>                                 reduceByRange/reduceByFile processes each
>                                 (range, file) separately.
>
>                                 It seems naturally (to me) to allow some
>                                 chunking of the mapping of the
>                                 ranges.  My naive approach is fast (I
>                                 assume) because I read multiple
>                                 ranges through one import() command.  I
>                                 know I have random access to
>                                 the
>                                 BigWig files, but there must still be
>                                 some overhead of seeking and
>                                 perhaps
>                                 more importantly instantiating/moving
>                                 stuff back and forth to R.  So
>                                 basically I would like to be able to
>                                 write a MAP function which takes
>                                      ranges, file
>                                 instead of just
>                                      range, file
>                                 And then chunk over say 1,000s of
>                                 ranges.  I could then have an
>                                 argument
>                                 to
>                                 reduceByXX called something like
>                                 rangeSize, which is kind of yieldSize.
>
>                                 Perhaps this is what is intended for the
>                                 reduceByYield on BigWig files?
>
>                                 In a way, this is what is done in the
>                                 vignette with the
>                                 coverage(BAMFILE)
>                                 example where tileGenome is essentially
>                                 constructed by the user to
>                                 chunk
>                                 the coverage computation.
>
>                                 I think the example of a set of regions
>                                 I am querying on the files,
>                                 will
>                                 be
>                                 an extremely common usecase going
>                                 forward. The raw data for all the
>                                 regions
>                                 together is "too big" but I do a
>                                 computation on each region to reduce
>                                 the
>                                 size.  In this situation, all the focus
>                                 is on the MAP step.  I see the
>                                 need
>                                 for REDUCE in the case of the t-test
>                                 example in the vignette, where the
>                                 return object is a single "thing" for
>                                 each base.  But in general, I
>                                 think
>                                 we will use these file structures a lot
>                                 to construct things without
>                                 REDUCE
>                                 (across neither files nor ranges).
>
>                                 Also, something completely different, it
>                                 seems like it would be
>                                 convenient
>                                 for stuff like BigWigFileViews to not
>                                 have to actually parse the file
>                                 in
>                                 the MAP step.  Somehow I would envision
>                                 some kind of reading function,
>                                 stored inside the object, which just
>                                 returns an Rle when I ask for a
>                                 (range, file).  Perhaps this is better
>                                 left for later.
>
>                                 Best,
>                                 Kasper
>
>                                 Examples
>
>                                 approach1 <- function(ranges, files) {
>                                        ## By hand
>                                        all.Rle <- lapply(files,
>                                 function(file) {
>                                            rle <- import(file, as =
>                                 "Rle", which = ranges, format =
>                                 "bw")[ranges]
>                                            rle
>                                        })
>                                        print(object.size(all.Rle), units
>                                 = "auto")
>                                        mat <- do.call(cbind,
>                                 lapply(all.Rle, function(xx) {
>                                            sapply(xx, mean)
>                                        }))
>                                        invisible(mat)
>                                 }
>
>                                 system.time({
>                                 mat1 <- approach1(all.grs[1:10],
>                                 all.files[1:5])
>                                 })
>
>                                 160.9 Kb
>                                        user  system elapsed
>                                       1.109   0.001   1.113
>
>                                 system.time({
>                                 mat1 <- approach1(all.grs[1:100],
>                                 all.files[1:5])
>                                 }) # less than 4x slower than previous call
>
>                                 3 Mb
>                                        user  system elapsed
>                                       4.101   0.019   4.291
>
>
>                                 approach2 <- function(ranges, files) {
>                                        gf <- GenomicFiles(rowData =
>                                 ranges, files = files)
>                                        MAPPER <- function(range, file,
>                                 ....) {
>                                            library(rtracklayer)
>                                            rle <- import(file, which =
>                                 range, as = "Rle", format =
>                                 "bw")[range]
>                                            mean(rle)
>                                        }
>                                        sExp <- reduceByRange(gf, MAP =
>                                 MAPPER, summarize = TRUE, BPPARAM
>                                 =
>                                 SerialParam())
>                                        sExp
>                                 }
>
>                                 system.time({
>                                 mat2 <- approach2(all.grs[1:10],
>                                 all.files[1:5])
>                                 })  # 8-9x slower than approach1
>
>                                       user  system elapsed
>                                       9.349   0.280   9.581
>
>                                 system.time({
>                                 mat2 <- approach2(all.grs[1:100],
>                                 all.files[1:5])
>                                 })  # 9x slower than previous call, 20x
>                                 slow than approach1 on same
>                                 input
>
>                                       user  system elapsed
>                                      89.310   0.627  91.044
>
>                                            [[alternative HTML version
>                                 deleted]]
>
>                                 _________________________________________________
>                                 Bioc-devel at r-project.org
>                                 <mailto:Bioc-devel at r-project.org>
>                                 mailing list
>                                 https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>                                 <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>
>
>
>
>
>                     _________________________________________________
>                     Bioc-devel at r-project.org
>                     <mailto:Bioc-devel at r-project.org> mailing list
>                     https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>                     <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>
>
>
>