<html><head></head><body><div style="font-family: Verdana;font-size: 12.0px;"><div class="signature">Max,<br/></div><div class="signature"><br/></div><div class="signature">Package ff is designed to support resampling with low RAM requirements - assuming a common filesystem between worker threads.<br/></div><div class="signature"><br/></div><div class="signature">Benefits are<br/></div><div class="signature">- you can parallel resample from a shared data.frame on disk (not needing any RAM for the total dataset, only needing RAM for the parallel resamples)<br/></div><div class="signature">- since ff uses memory mapping, there is shared caching of the dataset in the filesystem cache (the data is cached at max once for reduced disk access, not multiple times)<br/></div><div class="signature">- if work is to be distributed between multiple workers, there is no need to send datasets from master to slaves, sending indices is enough, and those can be very small, see ?chunk.ffdf ?ri<br/></div><div class="signature"><br/></div><div class="signature">Note also that<br/></div><div class="signature"><div class="signature">- for certain low cardinality integer columns the size on disk can be reduced dramatically, see ?vmode<br/></div>- without caching, taking a (random) resample will at worst cause sequential reads on disk because indices are sorted before doing the access, see ?as.hi<br/></div><div class="signature">- matrices can be stored in Row major order - which should speed-up resampling of rows, see ?dimorder<br/></div><div class="signature">- ffdf - ff's data.frames - can be composed of matrices in Row major order - one for each vmode<br/></div><div>- in case your resampling is used for voting as in bagging: ff has special support for adding votes to a matrix, see ?add ?swap and the 'add' parameter in ?Extract.ff<br/></div><div class="signature"><br/></div><div class="signature">The following presentation has examples of parallel access to the same ff<br/></div><div class="signature">http://ff.r-forge.r-project.org/ff&bit_UseR!2009.pdf<br/></div><div class="signature"><br/></div><div class="signature">Kind regards<br/></div><div class="signature">Jens Oehlschlägel<br/></div><div class="signature"><br/></div><div class="signature"><br/></div><div name="quote" style="margin:10px 5px 5px 10px; padding: 10px 0 10px 10px; border-left:2px solid #C3D9E5; word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;">

    <div style="margin:0 0 10px 0;">

        <b>Gesendet:</b> Dienstag, 19. Februar 2013 um 20:00 Uhr<br/>

        <b>Von:</b> "Max Kuhn" <mxkuhn@gmail.com><br/>

        <b>An:</b> r-sig-hpc@r-project.org<br/>


        <b>Betreff:</b> Re: [R-sig-hpc] communicating memory requirments

    <br/></div>

    <div name="quoted-content">

        Thanks for the replies.<br/>

<br/>

Being more specific, the analyses are mostly related to resampling of<br/>

various sorts (esp. the embarrassingly parallel type) that take a data<br/>

object into a worker, run a model and send summary statistics back to the<br/>

master<br/>

<br/>

>From the basic memory profiling I've done (outside of R while the scripts<br/>

are running), I might see that the master process consumes 2500M (VSIZE)<br/>

and each worker eats up a separate chunk (say 1800M). I have a medium<br/>

memory size machine that I've used and figured out how many workers I can<br/>

afford before I exceed physical memory (usually by first exceeding it).<br/>

<br/>

My experience is that people will just naively run the scripts and send an<br/>

email when it doesn't work within the limits on their machine. In the hopes<br/>

that some people might read the comments before running and I'd like to let<br/>

them know what they're in for. My current verbiage is:<br/>

<br/>

> ### WARNING: Be aware of how much memory is needed to parallel<br/>

> ### process. It can very quickly overwhelm the available hardware. We<br/>

> ### estimate the memory usage (VSIZE = total memory size) to be<br/>

> ### 2566M/core.<br/>

<br/>

That last number is determined over the workers that were created while the<br/>

script was running. I'm curious to know if the use of VSIZE is appropriate<br/>

(especially outside of *nix) and if this helps the user at all.<br/>

<br/>

The scripts are at the chapter level, so a number of objects are created in<br/>

the same session. I dont think I need to profile at the object level but at<br/>

the session level so Rprofmem doesn't appear to be the best tool.<br/>

<br/>

Thanks,<br/>

<br/>

Max<br/>

<br/>

<br/>

On Tue, Feb 19, 2013 at 1:11 PM, Martin Morgan <mtmorgan@fhcrc.org> wrote:<br/>

<br/>

> On 02/19/2013 05:16 AM, Max Kuhn wrote:<br/>

><br/>

>> I have some scripts for a book that I will be publishing and they use<br/>

>> parallel processing (via foreach). Some of the analyses use more memory<br/>

>> than some users will have on hand and, as the number of workers increases,<br/>

>> so do the memory demands.<br/>

>><br/>

>> I'd like to report the memory requirements in a way that most people will<br/>

>> understand (including me).<br/>

>><br/>

><br/>

> not really answering your question, but it seems like parallel evaluation<br/>

> in shared memory computers comes with an implicit need to manage memory,<br/>

> and that one would rather strive to implement algorithms that do their job<br/>

> in a memory efficient way. Probably this means iterating through data and<br/>

> aggregating results, which is the approach of biglm. The user shouldn't<br/>

> really be exposed to the need to choose a computer (or package) based on<br/>

> memory consumption of algorithms. I'm not throwing stones, having seldom<br/>

> managed this myself.<br/>

><br/>

> Martin<br/>

><br/>

><br/>

>> On OS X, I've run the scripts and did a roiling append of 'top' to capture<br/>

>> the memory used by the master process and the workers over time.<br/>

>><br/>

>> Can anyone suggest which parameters I should report (e.g. VSIZE)? Is the<br/>

>> situation appreciably different on Windows?<br/>

>><br/>

>> I admit to being fairly ignorant on this (complicated) subject so any<br/>

>> approach to informing the users would be very welcome.<br/>

>><br/>

>> Thanks,<br/>

>><br/>

>> Max<br/>

>><br/>

>>         [[alternative HTML version deleted]]<br/>

>><br/>

>> ______________________________**_________________<br/>

>> R-sig-hpc mailing list<br/>

>> R-sig-hpc@r-project.org<br/>

>> <a href="https://stat.ethz.ch/mailman/**listinfo/r-sig-hpc" target="_blank">https://stat.ethz.ch/mailman/**listinfo/r-sig-hpc</a><<a href="https://stat.ethz.ch/mailman/listinfo/r-sig-hpc" target="_blank">https://stat.ethz.ch/mailman/listinfo/r-sig-hpc</a>><br/>

>><br/>

>><br/>

><br/>

> --<br/>

> Computational Biology / Fred Hutchinson Cancer Research Center<br/>

> 1100 Fairview Ave. N.<br/>

> PO Box 19024 Seattle, WA 98109<br/>

><br/>

> Location: Arnold Building M1 B861<br/>

> Phone: (206) 667-2793<br/>

><br/>

<br/>

<br/>

<br/>

-- <br/>

<br/>

Max<br/>

<br/>

        [[alternative HTML version deleted]]<br/>

<br/>

_______________________________________________<br/>

R-sig-hpc mailing list<br/>

R-sig-hpc@r-project.org<br/>

<a href="https://stat.ethz.ch/mailman/listinfo/r-sig-hpc" target="_blank">https://stat.ethz.ch/mailman/listinfo/r-sig-hpc</a><br/>


    </div>

</div><div><br/><br/></div></div></body></html>