[Bioc-devel] Memory usage for bplapply

Martin Morgan mtmorg@n@bioc @ending from gm@il@com
Mon Jan 7 05:18:52 CET 2019


From the earlier example, whether the worker sees all the data or not depends on whether it is in the environment of FUN, the object sent to the worker.

I don't really know about packages and forked processes. I'd bet that the vector allocations are essentially constant, but that the S-Expressions that point to the symbols do actually get modified, e.g., when the user creates a symbol that references a package symbol (possibly incrementing the NAMED status of the S-expression) or even when the garbage collector comes along and decides that the S-expression in the package should be moved to a different generation.

Be sure to understand the difference (maybe you do) between the environment in which the function is defined and the environment in which it is called. Also note that as you restrict the environment in which a function is defined, you restrict the operations that the function perform; the reason a function foo in a package can call another function bar in the same package is because bar is defined in the same environment as foo, and 

  > local(1 + 2, envir = emptyenv())
 Error in 1 + 2 : could not find function "+"

Usually the bigger problem is that one serializes large data on the manager and sends it to the worker (e.g., reading chunks of a BAM file on the manager and sending each chunk to the worker) rather than arranging to do the heavy IO on the worker (e.g., sending instructions that the worker is supposed to read chromosome 1 from disk).

I think if one is worrying about memory at this level, then it's time to get a bigger computer!

Martin

On 1/6/19, 9:48 PM, "Shian Su" <su.s using wehi.edu.au> wrote:

    
    
    
    Can I get a indication here about what is expected to consume memory under fork and socket models as well as patterns to mitigate excessive memory consumption?
    
    
    When using sockets, the model is that of multiple communicating machines running on their own memory, so it makes sense that memory usage is duplicated for loaded packages and the parent environment. But is the while data object duplicated or
     only the portion of the tasks assigned to a thread? i.e. 4 mb of packages, 4 mb of parent environment, 4 mb of data to run bplapply over, is each thread going to consume 12mb or 9mb of memory? It is unclear to me whether the data object operated on should
     be thought of as a part of the parent environment.
    
    
    When using forks, the model is that of multiple processes running on shared memory. This is specific to macOS and Unix variants and I believe the model is meant to share memory until a write operation causes variables to be copied. I also believe
     R’s internal memory management can potentially touch all the variables and cause copies, so the worse case scenario is that everything is copied. What’s unclear is whether this applies to loaded packages, are they under the supervision of a garbage collector?
     So as per the previous scenario, from the second thread onwards, do we expect up to (0 + 4 + 1)mb, (4 + 4 + 1)mb or (4 + 4 + 4)mb of memory usage? Maybe even the ideal scenario of (0 + 0 + 1)?
    
    
    With regards to patterns to efficiently use memory, is it sufficient to keep the parent environment as compact as possible? Are there clever ways to use local() for this?
    
    
    Kind regards,
    Shian
    
    
    On 6 Jan 2019, at 9:24 am, Martin Morgan <mtmorgan.bioc using gmail.com> wrote:
    
    In
     one R session I did library(SummarizedExperiment) and then saved search(). In another R session I loaded the packages on the search path in reverse order, recording pryr::mem_used() after each. I ended up with
    
                         mem_used
    methods
                   25870312
    datasets
                  30062016
    utils
                     30062136
    grDevices
                 30062256
    graphics
                  30062376
    stats
                     30062496
    stats4
                    32262992
    parallel
                  32495080
    BiocGenerics
              38903928
    S4Vectors
                 59586928
    IRanges
                  100171896
    GenomeInfoDb
             113791328
    GenomicRanges
            154729400
    Biobase
                  163335336
    matrixStats
              163518520
    BiocParallel
             167373512
    DelayedArray
             280812736
    SummarizedExperiment
     317386656
    
    Each
     of the Bioconductor dependencies of SummarizedExperiment contribute to the overall size. Two dependencies (Biobase, DelayedArray) look a little unnecessary to me (they do not provide functionality that must be used by SummarizedExperiment) but removing them
     only reduces the total footprint to about 300MB. Somehow it makes sense that a package like SummarizedExperiment uses the data structures defined in other packages, and that it has a complex dependency graph. It is surprising how large the final footprint
     is.
    
    One
     possible way to avoid at least some of the cost is to Import: SummarizedExperiment in the DESCRIPTION file, but not mention SummarizedExperiment in the NAMESPACE. Use SummarizedExperiment::assay() in the code. I think this has complicated side effects, e.g.,
     adding methods to the imported methods table in your package (look for ".__T__" and ".__C__" (generic and class definitions) in ls(parent.env(getNamespace(<your package>)))), that indirectly increase the size of your package.
    
    I'm
     not exactly sure what you mean in your second paragraph, maybe a specific example (if necessary, create a small package on github) would help. It sounds like you're saying that even with doSNOW() there are additional costs to loading your package on the worker
     compared to in the master...
    
    Martin
    
    On
     1/5/19, 2:44 PM, "Lulu Chen" <luluchen using vt.edu>
     wrote:
    
       Hi
     Martin,
    
    
       Thanks
     for your explanation which make me understand BiocParallel much better. 
    
    
       I
     compare memory usage in my code before packaged (using doSNOW) and after packaged (using BiocParallel) and find the increased memory is caused by the attached packages, especially 'SummarizedExperiment'. 
       As
     required to support common Bioconductor class, I used importFrom(SummarizedExperiment,assay). After deleting this, the memory for each thread save nearly 200Mb. I open a new R session and find
    
    pryr::mem_used()
    
    
       38.5
     MB
    
    library(SummarizedExperiment)
    
    
    
    
    pryr::mem_used()
    
    
       314
     MB
    
        (I
     am still using R 3.5.2, not sure any update in develop version). I think it should be a issue. A lot of packages are importing SummarizedExperiment just for a support and never know it can cause such a problem.
    
    
       My
     package still imports other packages, e.g limma, fdrtool. Checked by pryr::mem_used() as above, only 1~2 Mb increase for each. I also check my_package in a new session, which is around 5Mb. However,  each thread in parallel computation still increases
        much
     larger than 5 Mb. I did a simulation: In my old code with doSNOW, I just inserted "require('my_package')" into foreach loop and keep other code as the same. I used 20 cores and 1000 jobs. Each thread still increases 20~30 Mb. I don't know if there are
        any
     other thing that cause extra cost to each thread. Thanks!
    
    
       Best,
       Lulu
    
    
    
    
    
    
       On
     Fri, Jan 4, 2019 at 2:38 PM Martin Morgan <mtmorgan.bioc using gmail.com> wrote:
    
    
       Memory
     use can be complicated to understand.
    
           library(BiocParallel)
    
           v
     <- replicate(100, rnorm(10000), simplify=FALSE)
           bplapply(v,
     sum)
    
       by
     default, bplapply splits 100 jobs (each element of the list) equally between the number of cores available, and sends just the necessary data to the cores. Again by default, the jobs are sent 'en masse' to the cores, so if there were 10 cores (and hence
        10
     tasks), the first core would receive the first 10 jobs and 10 x 10000 elements, and so on. The memory used to store v on the workers would be approximately the size of v, # of workers * jobs /per worker  * job size = 10 * 10 * 10000.
    
       If
     memory were particularly tight, or if computation time for each job was highly variable, it might be advantageous to sends jobs one at a time, by setting the number of tasks equal to the number of jobs SnowParam(workers = 10, tasks = length(v)). Then the
        amount
     of memory used to store v would only be # of workers * 1  * 10000; this is generally slower, because there is much more communication between the manager and the workers.
    
           m
     <- matrix(rnorm(100 * 10000), 100, 10000)
           bplapply(seq_len(nrow(m)),
     function(i, m) sum(m[i]), m)
    
       Here
     bplapply doesn't know how to send just some rows to the workers, so each worker gets a complete copy of m. This would be expensive.
    
           f
     <- function(x) sum(x)
    
           g
     <- function() {
               v
     <- replicate(100, rnorm(10000), simplify=FALSE)
               bplapply(v,
     f)
           }
    
       this
     has the same memory consequences as above, the function `f()` is defined in the .GlobalEnv, so only the function definition (small) is sent to the workers.   
    
    
           h
     <- function() {
               f
     <- function(x) sum(x)
               v
     <- replicate(100, rnorm(10000), simplify=FALSE)
               bplapply(v,
     f)
           }
    
        This
     is expensive. The function `f()` is defined in the body of the function `h()`. So the workers receive both the function f and the environment in which it defined. The environment includes v, so each worker receives a slice of v (for f() to operate on)
        AND
     an entire copy of v (because it is in the body of the environment where `f()` was defined. A similar cost would be paid in a package, if the package defined large data objects at load time.
    
       For
     more guidance, it might be helpful to provide a simplified example of what you did with doSNOW, and what you do with BiocParallel.
    
       Hope
     that helps,
    
       Martin
    
       On
     1/3/19, 11:52 PM, "Bioc-devel on behalf of Lulu Chen" <bioc-devel-bounces using r-project.org on behalf of
       luluchen using vt.edu>
     wrote:
    
           Dear
     all,
    
           I
     met a memory issue for bplapply with SnowParam(). I need to calculate
           something
     from a large matrix many many times. But from the discussions in
    
       https://support.bioconductor.org/p/92587 <https://support.bioconductor.org/p/92587>,
     I learned that bplapply copied
           the
     current and parent environment to each worker thread. Then means the
           large
     matrix in my package will be copied so many times. Do you have better
           suggestions
     in windows platform?
    
           Before
     I tried to package my code, I used doSNOW package with foreach
           %dopar%.
     It seems to consume less memory in each core (almost the size of
           the
     matrix the task needs). But bplapply seems to copy more then objects in
           current
     environment and the above one level environment. I am very
           confused.and
     just guess it was copying everything.
    
           Thanks
     for any help!
           Best,
           Lulu
    
               [[alternative
     HTML version deleted]]
    
           _______________________________________________
           Bioc-devel using r-project.org mailing
     list
    
       https://stat.ethz.ch/mailman/listinfo/bioc-devel <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
    
    
    
    
    
    _______________________________________________
    Bioc-devel using r-project.org mailing
     list
    https://stat.ethz.ch/mailman/listinfo/bioc-devel
    
    
    
    
    
    _______________________________________________
    
    
    The information in this email is confidential and intended solely for the addressee.
    You must not disclose, forward, print or use it without the permission of the sender.
    
    The Walter and Eliza Hall Institute acknowledges the Wurundjeri people of the Kulin
    
    Nation as the traditional owners of the land where our campuses are located and 
    the continuing connection to country and community.
    _______________________________________________ 
    
    


More information about the Bioc-devel mailing list