[Bioc-sig-seq] PDict question

Tue Jun 3 22:48:47 CEST 2008

Hi Harris,

[Sorry but this discussion belongs to where it came from so I'm
putting it back there.]

Harris A. Jaffee wrote:
> Nothing intelligent to say, but I agree -- it seems
> very suspicious that he had a failure.  Also, if you
> use some form of malloc(), won't you have accesss to
> *virtual* memory, so his so-called "20GB" of RAM is
> somewhat irrelevant.  More than that is available.
> 
> But I don't understand your memory allocation scheme.
> I just did PDict on 4M unique strings of width 36.  It
> ran up about 10 minutes of CPU time and was increasing
> in size VERY gradually from 2.5G to about 12G, but it
> didn't pass 5G until 9 minutes or so.  Certainly doesn't
> sound like everything is pre-allocated, as you describe.
> Is there an easy way to delineate what I am missing?

Good point! I forgot to mention that the temp buffer uses
user-controlled memory (malloc) instead of R-controlled (aka
transient) memory (Salloc). The reason I decided to use malloc()
is that it's *much* faster than Salloc(), at least on Linux (you
don't say what your OS is), but this might just be due to the fact
that Linux's malloc() is cheating i.e. it doesn't really allocate
the memory pages until the process actually tries to access them
(lazy memory allocation). So in the end the Linux kernel will end
up making a lot of small real allocations behind the scene as the
temp buffer is being filled up with the AC tree that is currently
under construction. And that could explain why top (you don't say
how you monitor this) is reporting that the memory used by your R
process is increasing gradually.

 From malloc()'s man page on my 64-bit openSUSE 10.3 system:

BUGS
        By  default,  Linux  follows  an optimistic memory allocation
        strategy.  This means that  when  malloc()  returns  non-NULL
        there  is  no  guarantee that the memory really is available.
        This is a really bad bug.  In case it turns out that the sys‐
        tem is out of memory, one or more processes will be killed by
        the infamous OOM killer.  In case  Linux  is  employed  under
        circumstances  where  it  would be less desirable to suddenly
        lose some randomly picked processes, and moreover the  kernel
        version is sufficiently recent, one can switch off this over‐
        committing behavior using a command like:

             # echo 2 > /proc/sys/vm/overcommit_memory

Can you try the above and see whether the R process is actually using
the 5G of mem from the very beginning or not? PDict() will also need
some extra G towards the end for copying the final AC tree back to
the R space (to an R integer vector that corresponds more or less to
the @actree at nodes slot of the PDict object) but I have to admit that
this doesn't really explain why you need 7 extra G for this. I'm not
sure what's going on...

In theory the total amount of memory you need is BPS + RS where
BPS is the biggest possible size of the tree and RS its real size
(BPS >= RS).

Then the amount of memory used by PDict() should be something like
this:

   PDict() progression                              Memory in use
   -------------------                              -------------

   phase 1: build AC tree in temp buffer            BPS

   phase 2: copy the temp buf to the                BPS + RS
            @actree at nodes slot of the PDict object

   phase 3: after freeing the temp buffer           RS

That's for the AC tree only but there is also the @dups slot in
the resulting PDict object that can be big too, it all depends on
how many duplicated reads you have in your input dictionary.

Hope this helps,
H.

> 
> Perhaps 'conservative' is better than your 'optimal'.
> 
> On Jun 3, 2008, at 2:36 PM, hpages at fhcrc.org wrote:
>>