[Bioc-devel] GenomicFiles: chunking
Valerie Obenchain
vobencha at fhcrc.org
Mon Oct 27 20:32:10 CET 2014
Sounds great. I think GenomicFiles is a good place for such a function -
it's along the lines of what we wanted to accomplish with pack / unpack.
Maybe your new function can be used by pack once finished. There's
definitely room for expanding that functionality.
Valerie
On 10/27/2014 12:07 PM, Michael Love wrote:
> hi Valerie,
>
> this sounds good to me.
>
> I am thinking of working on a function (here or elsewhere) that helps
> decide, for reduce by range, how to optimally chunk GRanges into a
> GRangesList. Practically, this could involve sampling the size of the
> imported data for a subset of cells in the (ranges, files) matrix, and
> asking/estimating the amount of memory available for each worker.
>
> best,
>
> Mike
>
>
> On Mon, Oct 27, 2014 at 2:35 PM, Valerie Obenchain <vobencha at fhcrc.org
> <mailto:vobencha at fhcrc.org>> wrote:
>
> Hi Kasper and Mike,
>
> I've added 2 new functions to GenomicFiles and deprecated the old
> classes. The vignette now has a graphic which (hopefully) clarifies
> the MAP / REDUCE mechanics of the different functions.
>
> Below is some performance testing for the new functions and answers
> to leftover questions from previous emails.
>
>
> Major changes to GenomicFiles (in devel):
>
> - *FileViews classes have been deprecated:
>
> The idea is to use the GenomicFiles class to hold any type of file
> be it BAM, BigWig, character vector etc. instead of having
> name-specific classes like BigWigFileViews. Currently GenomicFiles
> does not inherit from SummarizedExperiment but it may in the future.
>
> - Add reduceFiles() and reduceRanges():
>
> These functions pass all ranges or files to MAP vs the lapply
> approach taken in reduceByFiles() and reduceByRanges().
>
>
> (1) Performance:
>
> When testing with reduceByFile() you noted "GenomicFiles" is 10-20x
> slower than the straightforward approach". You also noted this was
> probably because of the lapply over all ranges - true. (Most likely
> there was overhead in creating the SE object as well.) With the new
> reduceFiles(), passing all ranges at once, we see performance very
> similar to that of the 'by hand' approach.
>
> In the test code I've used Bam instead of BigWig. Both test
> functions output lists, have comparable MAP and REDUCE steps etc.
>
> I used 5 files ('fls') and a granges ('grs') of length 100.
> > length(grs)
> [1] 100
>
> > sum(width(grs))
> [1] 1000000
>
> FUN1 is the 'by hand' version. These results are similar to what you
> saw, not quite a 4x difference between 10 and 100 ranges.
>
> >> microbenchmark(FUN1(grs[1:10], fls), FUN1(grs[1:100], fls),
> times=10)
> > Unit: seconds
> > expr min lq mean median
> uq max
> > FUN1(grs[1:10], fls) 1.177858 1.190239 1.206127 1.201331
> 1.222298 1.256741
> > FUN1(grs[1:100], fls) 4.145503 4.163404 4.249619 4.208486
> 4.278463 4.533846
> > neval
> > 10
> > 10
>
> FUN2 is the reduceFiles() approach and the results are very similar
> to FUN1.
>
> >> microbenchmark(FUN2(grs[1:10], fls), FUN2(grs[1:100], fls),
> times=10)
> > Unit: seconds
> > expr min lq mean median
> uq max
> > FUN2(grs[1:10], fls) 1.242767 1.251188 1.257531 1.253154
> 1.267655 1.275698
> > FUN2(grs[1:100], fls) 4.251010 4.340061 4.390290 4.361007
> 4.384064 4.676068
> > neval
> > 10
> > 10
>
>
> (2) Request for "chunking of the mapping of ranges":
>
> For now we decided not to add a 'yieldSize' argument for chunking.
> There are 2 approaches to chunking through ranges *within* the same
> file. In both cases the user spits the ranges, either before calling
> the function or in the MAP step.
>
> i) reduceByFile() with a GRangesList:
>
> The user provides a GRangesList as the 'ranges' arg. On each worker,
> lapply applies MAP to the one files and all elements of the GRangesList.
>
> ii) reduceFiles() with a MAP that handles chunking:
>
> The user split ranges in MAP and uses lapply or another loop to
> iterate. For example,
>
> MAP <- function(range, file, ...) {
> lst = split(range, someFactor)
> someFUN = function(rngs, file, ...) do something
> lapply(lst, FUN=someFun, ...)
> }
>
> The same ideas apply for chunking though ranges *across* files with
> reduceByRange() and reduceRanges().
>
> iii) reduceByRange() with a GRangesList:
>
> Mike has a good example here:
> https://gist.github.com/__mikelove/deaff999984dc75f125d
> <https://gist.github.com/mikelove/deaff999984dc75f125d>
>
> iv) reduceRanges():
>
> 'ranges' should be a GRangesList. The MAP step will operate on an
> element of the GRangesList and all files. Unless you want to operate
> on all files at once I'd use reduceByRange() instead.
>
>
> (3) Return objects have different shape:
>
> Previous question:
>
> "...
> Why does the return object of reduceByFile vs reduceByRange (with
> summarize = FALSE) different? I understand why internally you have
> different nesting schemes (files and ranges) for the two functions,
> but it is not clear to me that it is desirable to have the return
> object depend on how the computation was done.
> ..."
>
> reduceByFile() and reduceFiles() output a list the same length as
> the number of files while reduceByRange() and reduceRanges() output
> a list the same length as the number of ranges.
>
> Reduction is different depending on which function is chosen; data
> are collapsed either within a file or across files. When REDUCE does
> something substantial the output are not equivalent.
>
> While it's possible to get the same result (REDUCE simply unlists or
> isn't used), the two approaches were not intended to be equivalent
> ways of arriving at the same end. The idea was that the user had a
> specific use case in mind - they either wanted to collapse the data
> across or within files.
>
>
> (4) return object from coverage(BigWigFileViews):
>
> Previous comment:
>
> "...
> coverage(BigWigFileViews) returns a "wrong" assay object in my opinion,
> ...
> Specifically, each (i,j) entry in the object is an RleList with a single
> element with a name equal to the seqnames of the i'th entry in the query
> GRanges. To me, this extra nestedness is unnecessary; I would have
> expected an Rle instead of an RleList with 1 element.
> ..."
>
> The return value from coverage(x) is an RleList with one coverage
> vector per seqlevel in 'x'. Even if there is only one seqlevel, the
> result still comes back as an RleList. This is just the default
> behavior.
>
>
> (5) separate the 'read' function from the MAP step
>
> Previous comment:
>
> "...
> Also, something completely different, it seems like it would be
> convenient for stuff like BigWigFileViews to not have to actually
> parse the file in the MAP step. Somehow I would envision some kind
> of reading function, stored inside the object, which just returns an
> Rle when I ask for a (range, file). Perhaps this is better left for
> later.
> ..."
>
> The current approach for the reduce* functions is for MAP to both
> extract and manipulate data. The idea of separating the extraction
> step is actually implemented in reduceByYield(). (This function used
> to be yieldReduce() in Rsamtools in past releases.) For
> reduceByYield() teh user must specify YIELD (a reader function),
> MAP, REDUCE and DONE (criteria to stop iteration).
>
> I'm not sure what is best here. I thought the many-argument approach
> of reduceByYield() was possibly confusing or burdensome and so
> didn't use it in the other GenomicFiles functions. Maybe it's not
> confusing but instead makes the individual steps more clear. What do
> you think,
>
> - Should the reader function be separate from the MAP? What are the
> advantages?
>
> - Should READER, MAP, REDUCE be stored inside the GenomicFiles
> object or supplied as arguments to the functions?
>
>
> (6) unnamed assay in SummarizedExperiment
>
> Previous comment:
>
> "...
> The return object of reduceByRange / reduceByFile with summarize =
> TRUE is a SummarizedExperiment with an unnamed assay. I was
> surprised to see that this is even possible.
> ..."
>
> There is no default name for SummarizedExperiment in general. I've
> named the assay 'data' for lack of a better term. We could also go
> with 'reducedData' or another suggestion.
>
>
> Thanks for the feedback.
>
> Valerie
>
>
>
>
> On 10/01/2014 08:30 AM, Michael Love wrote:
>
> hi Kasper,
>
> For a concrete example, I posted a R and Rout file here:
>
> https://gist.github.com/__mikelove/deaff999984dc75f125d
> <https://gist.github.com/mikelove/deaff999984dc75f125d>
>
> Things to note: 'ranges' is a GRangesList, I cbind() the numeric
> vectors in the REDUCE, and then rbind() the final list to get the
> desired matrix.
>
> Other than the weird column name 'init', does this give you what
> you want?
>
> best,
>
> Mike
>
> On Tue, Sep 30, 2014 at 2:08 PM, Michael Love
> <michaelisaiahlove at gmail.com
> <mailto:michaelisaiahlove at gmail.com>> wrote:
>
> hi Kasper and Valerie,
>
> In Kasper's original email:
>
> "I would like to be able to write a MAP function which takes
> ranges, file
> instead of just
> range, file
> And then chunk over say 1,000s of ranges. I could then have an
> argument to reduceByXX called something like rangeSize,
> which is kind
> of yieldSize."
>
> I was thinking, for your BigWig example, we get part of the
> way just
> by splitting the ranges into a GRangesList of your desired
> chunk size.
> Then the parallel iteration is over chunks of GRanges.
> Within your MAP
> function:
>
> import(file, which = ranges, as = "Rle", format = "bw")[ranges]
>
> returns an RleList, and calling mean() on this gives a
> numeric vector
> of the mean of the coverage for each range.
>
> Then the only work needed is how to package the result into
> something
> reasonable. What you would get now is a list (length of
> GRangesList)
> of lists (length of files) of vectors (length of GRanges).
>
> best,
>
> Mike
>
> On Mon, Sep 29, 2014 at 7:09 PM, Valerie Obenchain
> <vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>> wrote:
>
> Hi Kasper,
>
> The reduceBy* functions were intended to combine data
> across ranges or
> files. If we have 10 ranges and 3 files you can think of
> it as a 10 x 3 grid
> where we'll have 30 queries and therefore 30 pieces of
> information that can
> be combined across different dimensions.
>
> The request to 'processes ranges in one operation' is a
> good suggestion and,
> as you say, may even be the more common use case.
>
> Some action items and explanations:
>
> 1) treat ranges / files as a group
>
> I'll add functions to treat all ranges / files as a
> group; essentially no
> REDUCER other than concatenation. Chunking (optional)
> will occur as defined
> by 'yieldSize' in the BamFile.
>
> 2) class clean-up
>
> The GenomicFileViews class was the first iteration. It
> was overly
> complicated and the complication didn't gain us much. In
> my mind the
> GenomicFiles class is a striped down version should
> replace the *FileViews
> classes. I plan to deprecate the *FileViews classes.
>
> Ideally the class(s) in GenomicFiles would inherit from
> SummarizedExperiment. A stand-alone package for
> SummarizedExperiment is in
> the works for the near future. Until those plans are
> finalized I'd rather
> not change the inheritance structure in GenomicFiles.
>
> 3) pack / unpack
>
> I experimented with this an came to the conclusion that
> packing / unpacking
> should be left to the user vs being done behind the
> scenes with reduceBy*.
> The primary difficulty is that you don't have
> pre-knowledge of the output
> format of the user's MAPPER. If MAPPER outputs an Rle,
> unpacking may be
> straightforward. If MAPPER outputs a NumericList or
> matrix or vector with no
> genomic coordinates then things are more complicated.
>
> I'm open if others have suggestions / prototypes.
>
> 4) reduceByYield
>
> This was intended for chunking through ranges in a
> single file. You can
> imagine using bplapply() over files where each file is
> chunked through with
> reduceByYield(). All ranges are chunked by 'yieldSize'
> defined in the
> BamFile unless a 'param' dictates a subset of ranges.
>
> 5) additional tidy
>
> I'll add a name to the 'assays' when summarize=TRUE,
> make sure return types
> are consistent when summarize=FALSE, etc.
>
> Thanks for the testing and feedback.
>
>
> Valerie
>
>
> On 09/29/2014 07:18 AM, Michael Love wrote:
>
>
> On Mon, Sep 29, 2014 at 9:09 AM, Kasper Daniel Hansen
> <kasperdanielhansen at gmail.com
> <mailto:kasperdanielhansen at gmail.com>> wrote:
>
>
> I don't fully understand "the use case for
> reducing by range is when the
> entire dataset won't fit into memory". The
> basic assumption of these
> functions (as far as I can see) is that the
> output data fits in memory.
> What may not fit in memory is various earlier
> "iterations" of the data.
> For
> example, in my use case, if I just read in all
> the data in all the ranges
> in
> all the samples it is basically Rle's across
> 450MB times 38 files, which
> is
> not small. What fits in memory is smaller
> chunks of this; that is true
> for
> every application.
>
>
> I was unclear. I meant that, in approach1, you have
> an object,
> all.Rle, which contains Rles for every range over
> every file. Can you
> actually run this approach on the full dataset?
>
> Reducing by range (or file) only makes sense
> when the final output
> includes
> one entity for several ranges/files ... right?
> So I don't see how reduce
> would help me.
>
>
> Yes, I think we agree. This is not a good use case
> for reduce by range
> as now implemented.
>
> This is a use case which would benefit from the
> user-facing function
> internally calling pack()/unpack() to reduce the
> number of import()
> calls, and then in the end giving back the mean
> coverage over the
> input ranges. I want this too.
>
>
> https://github.com/__Bioconductor/GenomicFileViews/__issues/2#issuecomment-32625456
> <https://github.com/Bioconductor/GenomicFileViews/issues/2#issuecomment-32625456>
> (link to the old github repo, the new github repo is
> named GenomicFiles)
>
> As I see the pack()/unpack() paradigm, it just
> re-orders the query ranges
> (which is super nice and matters a lot for speed
> for some applications).
> As
> I understand the code (and my understanding is
> developing) we need an
> extra
> layer to support processing multiple ranges in
> one operation.
>
> I am happy to help apart from complaining.
>
> Best,
> Kasper
>
> On Mon, Sep 29, 2014 at 8:55 AM, Michael Love
> <michaelisaiahlove at gmail.com
> <mailto:michaelisaiahlove at gmail.com>>
> wrote:
>
>
>
> Thanks for checking it out and benchmarking.
> We should be more clear
> in the docs that the use case for reducing
> by range is when the entire
> dataset won't fit into memory. Also, we had
> some discussion and
> Valerie had written up methods for packing
> up the ranges supplied by
> the user into a better form for querying
> files. In your case it would
> have packed many ranges together, to reduce
> the number of import()
> calls like your naive approach. See the
> pack/unpack functions, which
> are not in the vignette but are in the man
> pages. If I remember,
> writing code to unpack() the result was not
> so simple, and development
> of these functions was set aside for the moment.
>
> Mike
>
> On Sun, Sep 28, 2014 at 10:49 PM, Kasper
> Daniel Hansen
> <kasperdanielhansen at gmail.com
> <mailto:kasperdanielhansen at gmail.com>> wrote:
>
>
>
> I am testing GenomicFiles.
>
> My use case: I have 231k ranges of
> average width 1.9kb and total width
> 442
> MB. I also have 38 BigWig files. I
> want to compute the average
> coverage
> of the 38 BigWig files inside each
> range. This is similar to wanting
> to
> get coverage of - say - all promoters in
> the human genome.
>
> My data is residing on a file server
> which is connected to the compute
> node
> through ethernet, so basically I have
> very slow file access. Also. the
> BigWig files are (perhaps unusually)
> big: ~2GB / piece.
>
> Below I have two approaches, one using
> basically straightforward code
> and
> the other using GenomicFiles (examples
> are 10 / 100 ranges on 5 files).
> Basically GenomicFiles is 10-20x slower
> than the straightforward
> approach.
> This is most likely because
> reduceByRange/reduceByFile processes each
> (range, file) separately.
>
> It seems naturally (to me) to allow some
> chunking of the mapping of the
> ranges. My naive approach is fast (I
> assume) because I read multiple
> ranges through one import() command. I
> know I have random access to
> the
> BigWig files, but there must still be
> some overhead of seeking and
> perhaps
> more importantly instantiating/moving
> stuff back and forth to R. So
> basically I would like to be able to
> write a MAP function which takes
> ranges, file
> instead of just
> range, file
> And then chunk over say 1,000s of
> ranges. I could then have an
> argument
> to
> reduceByXX called something like
> rangeSize, which is kind of yieldSize.
>
> Perhaps this is what is intended for the
> reduceByYield on BigWig files?
>
> In a way, this is what is done in the
> vignette with the
> coverage(BAMFILE)
> example where tileGenome is essentially
> constructed by the user to
> chunk
> the coverage computation.
>
> I think the example of a set of regions
> I am querying on the files,
> will
> be
> an extremely common usecase going
> forward. The raw data for all the
> regions
> together is "too big" but I do a
> computation on each region to reduce
> the
> size. In this situation, all the focus
> is on the MAP step. I see the
> need
> for REDUCE in the case of the t-test
> example in the vignette, where the
> return object is a single "thing" for
> each base. But in general, I
> think
> we will use these file structures a lot
> to construct things without
> REDUCE
> (across neither files nor ranges).
>
> Also, something completely different, it
> seems like it would be
> convenient
> for stuff like BigWigFileViews to not
> have to actually parse the file
> in
> the MAP step. Somehow I would envision
> some kind of reading function,
> stored inside the object, which just
> returns an Rle when I ask for a
> (range, file). Perhaps this is better
> left for later.
>
> Best,
> Kasper
>
> Examples
>
> approach1 <- function(ranges, files) {
> ## By hand
> all.Rle <- lapply(files,
> function(file) {
> rle <- import(file, as =
> "Rle", which = ranges, format =
> "bw")[ranges]
> rle
> })
> print(object.size(all.Rle), units
> = "auto")
> mat <- do.call(cbind,
> lapply(all.Rle, function(xx) {
> sapply(xx, mean)
> }))
> invisible(mat)
> }
>
> system.time({
> mat1 <- approach1(all.grs[1:10],
> all.files[1:5])
> })
>
> 160.9 Kb
> user system elapsed
> 1.109 0.001 1.113
>
> system.time({
> mat1 <- approach1(all.grs[1:100],
> all.files[1:5])
> }) # less than 4x slower than previous call
>
> 3 Mb
> user system elapsed
> 4.101 0.019 4.291
>
>
> approach2 <- function(ranges, files) {
> gf <- GenomicFiles(rowData =
> ranges, files = files)
> MAPPER <- function(range, file,
> ....) {
> library(rtracklayer)
> rle <- import(file, which =
> range, as = "Rle", format =
> "bw")[range]
> mean(rle)
> }
> sExp <- reduceByRange(gf, MAP =
> MAPPER, summarize = TRUE, BPPARAM
> =
> SerialParam())
> sExp
> }
>
> system.time({
> mat2 <- approach2(all.grs[1:10],
> all.files[1:5])
> }) # 8-9x slower than approach1
>
> user system elapsed
> 9.349 0.280 9.581
>
> system.time({
> mat2 <- approach2(all.grs[1:100],
> all.files[1:5])
> }) # 9x slower than previous call, 20x
> slow than approach1 on same
> input
>
> user system elapsed
> 89.310 0.627 91.044
>
> [[alternative HTML version
> deleted]]
>
> _________________________________________________
> Bioc-devel at r-project.org
> <mailto:Bioc-devel at r-project.org>
> mailing list
> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>
>
>
>
>
> _________________________________________________
> Bioc-devel at r-project.org
> <mailto:Bioc-devel at r-project.org> mailing list
> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>
>
>
>
More information about the Bioc-devel
mailing list