[Bioc-devel] GPos slower than GRanges ?

Hervé Pagès hpages at fredhutch.org
Fri Feb 9 10:54:06 CET 2018


Hi Charles,

On 02/08/2018 08:03 PM, Charles Plessy wrote:
> Hello,
> 
> I have just discovered the GPos class, and I would like to use it in
> my "CAGEr" package, where for the moment I store single-nucleotide
> positions of transcription start sites in GRanges of width 1.
> 
> But a simple microbenchmark sugests that, although GPos are more
> memory-efficient, they also may be more CPU-hungry, at least
> with the "range" function.
> 
> Is there a way to optimise, or is it better to stay with
> GRanges of width 1 when memory is not an issue ?
> 
>> gpos1 <- GPos(c("chr1:44-53", "chr1:5-10", "chr2:2-5"))
> 
>> granges1 <- GRanges(gpos1)
> 
>> microbenchmark::microbenchmark(range(granges1), range(gpos1))
> Unit: milliseconds
>              expr      min       lq    mean   median       uq      max neval cld
>   range(granges1) 21.42761 21.97009 24.1627 22.24532 22.92655 179.9715   100  a
>      range(gpos1) 30.11515 30.84472 32.8824 31.36639 32.19281 104.3027   100   b

Timing such small objects is not really meaningful.

GPos objects are optimized to perform well when they contain long runs
of consecutive positions. For example:

   gpos2 <- GPos(GRanges("chr1", successiveIRanges(rep(990, 2000), 
gapwidth=10)))
   gr2 <- as(gpos2, "GRanges")

   microbenchmark(range(gpos2), range(gr2))
   # Unit: milliseconds
   #          expr      min       lq     mean   median       uq      max 
neval cld
   #  range(gpos2) 102.4948 111.9229 137.5418 116.0058 134.2129 239.0805 
   100  a
   #    range(gr2) 111.3651 118.2075 154.2758 133.3702 211.2164 232.4975 
   100   b

   microbenchmark(coverage(gpos2), coverage(gr2))
   # Unit: milliseconds
   #             expr       min       lq     mean   median       uq 
max neval
   #  coverage(gpos2)  98.09502 106.3827 143.7039 111.9778 138.1875 
304.8126   100
   #    coverage(gr2) 152.82492 168.9123 204.8362 175.1129 189.7343 
363.9795   100
  cld
   a
    b

so not a big difference but a small advantage for GPos.

However, a big advantage for GPos in terms of memory footprint:

   object.size(gpos2)
   # 26520 bytes
   object.size(gr2)
   # 15849120 bytes

Anyway, if memory is not an issue, then it won't make much difference
whether you use GRanges or GPos.

Cheers,
H.


> 
>> sessionInfo()
> R version 3.4.3 (2017-11-30)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Debian GNU/Linux 9 (stretch)
> 
> Matrix products: default
> BLAS: /usr/lib/libblas/libblas.so.3.7.0
> LAPACK: /usr/lib/lapack/liblapack.so.3.7.0
> 
> locale:
>   [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8
>   [4] LC_COLLATE=en_GB.UTF-8     LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8
>   [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                  LC_ADDRESS=C
> [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
> 
> attached base packages:
> [1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base
> 
> other attached packages:
> [1] GenomicRanges_1.31.16 GenomeInfoDb_1.15.5   IRanges_2.13.22       S4Vectors_0.17.30
> [5] BiocGenerics_0.25.2
> 
> loaded via a namespace (and not attached):
>   [1] Rcpp_0.12.14            XVector_0.19.8          MASS_7.3-47             splines_3.4.3
>   [5] zlibbioc_1.24.0         munsell_0.4.3           lattice_0.20-35         colorspace_1.3-2
>   [9] rlang_0.1.4             multcomp_1.4-8          plyr_1.8.4              tools_3.4.3
> [13] grid_3.4.3              gtable_0.2.0            TH.data_1.0-8           survival_2.41-3
> [17] yaml_2.1.15             lazyeval_0.2.1          tibble_1.3.4            Matrix_1.2-12
> [21] GenomeInfoDbData_0.99.1 ggplot2_2.2.1           codetools_0.2-15        microbenchmark_1.4-2.1
> [25] bitops_1.0-6            RCurl_1.95-4.10         sandwich_2.4-0          compiler_3.4.3
> [29] scales_0.5.0            mvtnorm_1.0-6           zoo_1.8-0
> 
> (I have also made a benchmark on "real" data, which confirmed the test above)
> 
> Have a nice day,
> 
> Charles
> 

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-devel mailing list