[Bioc-devel] GenomicFiles: chunking
Kasper Daniel Hansen
kasperdanielhansen at gmail.com
Mon Sep 29 04:57:03 CEST 2014
I should also have compared to just using coverage(BigWigFileViews)
approach3 <- function(ranges, files) {
bwViews <- BigWigFileViews(fileRange = ranges, fileList = files)
cov <- coverage(bwViews)
vec <- sapply(assay(cov), function(xx) mean(xx[[1]]))
out <- matrix(vec, ncol = length(files), nrow = length(ranges))
out
}
system.time({
mat3 <- approach3(all.grs[1:10], all.files[1:5])
})
## user system elapsed
## 9.705 1.531 2.646
system.time({
mat3 <- approach3(all.grs[1:100], all.files[1:5])
})
## user system elapsed
## 66.454 1.587 18.151
On Sun, Sep 28, 2014 at 10:49 PM, Kasper Daniel Hansen <
kasperdanielhansen at gmail.com> wrote:
> I am testing GenomicFiles.
>
> My use case: I have 231k ranges of average width 1.9kb and total width 442
> MB. I also have 38 BigWig files. I want to compute the average coverage
> of the 38 BigWig files inside each range. This is similar to wanting to
> get coverage of - say - all promoters in the human genome.
>
> My data is residing on a file server which is connected to the compute
> node through ethernet, so basically I have very slow file access. Also.
> the BigWig files are (perhaps unusually) big: ~2GB / piece.
>
> Below I have two approaches, one using basically straightforward code and
> the other using GenomicFiles (examples are 10 / 100 ranges on 5 files).
> Basically GenomicFiles is 10-20x slower than the straightforward approach.
> This is most likely because reduceByRange/reduceByFile processes each
> (range, file) separately.
>
> It seems naturally (to me) to allow some chunking of the mapping of the
> ranges. My naive approach is fast (I assume) because I read multiple
> ranges through one import() command. I know I have random access to the
> BigWig files, but there must still be some overhead of seeking and perhaps
> more importantly instantiating/moving stuff back and forth to R. So
> basically I would like to be able to write a MAP function which takes
> ranges, file
> instead of just
> range, file
> And then chunk over say 1,000s of ranges. I could then have an argument
> to reduceByXX called something like rangeSize, which is kind of yieldSize.
>
> Perhaps this is what is intended for the reduceByYield on BigWig files?
>
> In a way, this is what is done in the vignette with the coverage(BAMFILE)
> example where tileGenome is essentially constructed by the user to chunk
> the coverage computation.
>
> I think the example of a set of regions I am querying on the files, will
> be an extremely common usecase going forward. The raw data for all the
> regions together is "too big" but I do a computation on each region to
> reduce the size. In this situation, all the focus is on the MAP step. I
> see the need for REDUCE in the case of the t-test example in the vignette,
> where the return object is a single "thing" for each base. But in general,
> I think we will use these file structures a lot to construct things without
> REDUCE (across neither files nor ranges).
>
> Also, something completely different, it seems like it would be convenient
> for stuff like BigWigFileViews to not have to actually parse the file in
> the MAP step. Somehow I would envision some kind of reading function,
> stored inside the object, which just returns an Rle when I ask for a
> (range, file). Perhaps this is better left for later.
>
> Best,
> Kasper
>
> Examples
>
> approach1 <- function(ranges, files) {
> ## By hand
> all.Rle <- lapply(files, function(file) {
> rle <- import(file, as = "Rle", which = ranges, format =
> "bw")[ranges]
> rle
> })
> print(object.size(all.Rle), units = "auto")
> mat <- do.call(cbind, lapply(all.Rle, function(xx) {
> sapply(xx, mean)
> }))
> invisible(mat)
> }
>
> system.time({
> mat1 <- approach1(all.grs[1:10], all.files[1:5])
> })
>
> 160.9 Kb
> user system elapsed
> 1.109 0.001 1.113
>
> system.time({
> mat1 <- approach1(all.grs[1:100], all.files[1:5])
> }) # less than 4x slower than previous call
>
> 3 Mb
> user system elapsed
> 4.101 0.019 4.291
>
>
> approach2 <- function(ranges, files) {
> gf <- GenomicFiles(rowData = ranges, files = files)
> MAPPER <- function(range, file, ....) {
> library(rtracklayer)
> rle <- import(file, which = range, as = "Rle", format =
> "bw")[range]
> mean(rle)
> }
> sExp <- reduceByRange(gf, MAP = MAPPER, summarize = TRUE, BPPARAM =
> SerialParam())
> sExp
> }
>
> system.time({
> mat2 <- approach2(all.grs[1:10], all.files[1:5])
> }) # 8-9x slower than approach1
>
> user system elapsed
> 9.349 0.280 9.581
>
> system.time({
> mat2 <- approach2(all.grs[1:100], all.files[1:5])
> }) # 9x slower than previous call, 20x slow than approach1 on same input
>
> user system elapsed
> 89.310 0.627 91.044
>
[[alternative HTML version deleted]]
More information about the Bioc-devel
mailing list