[Bioc-devel] GenomicFiles: chunking

Kasper Daniel Hansen kasperdanielhansen at gmail.com
Mon Sep 29 04:57:03 CEST 2014


I should also have compared to just using coverage(BigWigFileViews)

approach3 <- function(ranges, files) {
    bwViews <- BigWigFileViews(fileRange = ranges, fileList = files)
    cov <- coverage(bwViews)
    vec <- sapply(assay(cov), function(xx) mean(xx[[1]]))
    out <- matrix(vec, ncol = length(files), nrow = length(ranges))
    out
}

system.time({
mat3 <- approach3(all.grs[1:10], all.files[1:5])
})

  ##  user  system elapsed
  ## 9.705   1.531   2.646

system.time({
mat3 <- approach3(all.grs[1:100], all.files[1:5])
})

 ##   user  system elapsed
 ## 66.454   1.587  18.151


On Sun, Sep 28, 2014 at 10:49 PM, Kasper Daniel Hansen <
kasperdanielhansen at gmail.com> wrote:

> I am testing GenomicFiles.
>
> My use case: I have 231k ranges of average width 1.9kb and total width 442
> MB.  I also have 38 BigWig files.  I want to compute the average coverage
> of the 38 BigWig files inside each range.  This is similar to wanting to
> get coverage of - say - all promoters in the human genome.
>
> My data is residing on a file server which is connected to the compute
> node through ethernet, so basically I have very slow file access.  Also.
> the BigWig files are (perhaps unusually) big: ~2GB / piece.
>
> Below I have two approaches, one using basically straightforward code and
> the other using GenomicFiles (examples are 10 / 100 ranges on 5 files).
> Basically GenomicFiles is 10-20x slower than the straightforward approach.
> This is most likely because reduceByRange/reduceByFile processes each
> (range, file) separately.
>
> It seems naturally (to me) to allow some chunking of the mapping of the
> ranges.  My naive approach is fast (I assume) because I read multiple
> ranges through one import() command.  I know I have random access to the
> BigWig files, but there must still be some overhead of seeking and perhaps
> more importantly instantiating/moving stuff back and forth to R.  So
> basically I would like to be able to write a MAP function which takes
>   ranges, file
> instead of just
>   range, file
> And then chunk over say 1,000s of ranges.  I could then have an argument
> to reduceByXX called something like rangeSize, which is kind of yieldSize.
>
> Perhaps this is what is intended for the reduceByYield on BigWig files?
>
> In a way, this is what is done in the vignette with the coverage(BAMFILE)
> example where tileGenome is essentially constructed by the user to chunk
> the coverage computation.
>
> I think the example of a set of regions I am querying on the files, will
> be an extremely common usecase going forward. The raw data for all the
> regions together is "too big" but I do a computation on each region to
> reduce the size.  In this situation, all the focus is on the MAP step.  I
> see the need for REDUCE in the case of the t-test example in the vignette,
> where the return object is a single "thing" for each base.  But in general,
> I think we will use these file structures a lot to construct things without
> REDUCE (across neither files nor ranges).
>
> Also, something completely different, it seems like it would be convenient
> for stuff like BigWigFileViews to not have to actually parse the file in
> the MAP step.  Somehow I would envision some kind of reading function,
> stored inside the object, which just returns an Rle when I ask for a
> (range, file).  Perhaps this is better left for later.
>
> Best,
> Kasper
>
> Examples
>
> approach1 <- function(ranges, files) {
>     ## By hand
>     all.Rle <- lapply(files, function(file) {
>         rle <- import(file, as = "Rle", which = ranges, format =
> "bw")[ranges]
>         rle
>     })
>     print(object.size(all.Rle), units = "auto")
>     mat <- do.call(cbind, lapply(all.Rle, function(xx) {
>         sapply(xx, mean)
>     }))
>     invisible(mat)
> }
>
> system.time({
> mat1 <- approach1(all.grs[1:10], all.files[1:5])
> })
>
> 160.9 Kb
>     user  system elapsed
>    1.109   0.001   1.113
>
> system.time({
> mat1 <- approach1(all.grs[1:100], all.files[1:5])
> }) # less than 4x slower than previous call
>
> 3 Mb
>     user  system elapsed
>    4.101   0.019   4.291
>
>
> approach2 <- function(ranges, files) {
>     gf <- GenomicFiles(rowData = ranges, files = files)
>     MAPPER <- function(range, file, ....) {
>         library(rtracklayer)
>         rle <- import(file, which = range, as = "Rle", format =
> "bw")[range]
>         mean(rle)
>     }
>     sExp <- reduceByRange(gf, MAP = MAPPER, summarize = TRUE, BPPARAM =
> SerialParam())
>     sExp
> }
>
> system.time({
> mat2 <- approach2(all.grs[1:10], all.files[1:5])
> })  # 8-9x slower than approach1
>
>    user  system elapsed
>    9.349   0.280   9.581
>
> system.time({
> mat2 <- approach2(all.grs[1:100], all.files[1:5])
> })  # 9x slower than previous call, 20x slow than approach1 on same input
>
>    user  system elapsed
>   89.310   0.627  91.044
>

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list