[Bioc-sig-seq] PDict question

hpages at fhcrc.org hpages at fhcrc.org
Tue Jun 3 20:36:32 CEST 2008


Hi Joseph,

You could run PDict() in debug mode by calling:

   > Biostrings:::debug_ACtree_utils()

first and then try to run your example again and you would see
something like this:

   > NM_seq_pDict=PDict(NM_seq_clean)
   [DEBUG] alloc_actree_nodes_buf(): length=4817537 width=36 maxnodes=126030830
   [DEBUG] alloc_actree_nodes_buf(): allocating actree_nodes_buf  
(bufsize=4032986560) ... OK
   [DEBUG] CWdna_free_actree_nodes_buf(): freeing actree_nodes_buf ... OK

This indicates that PDict() needs to allocate a temporary buffer (the
actree_nodes_buf C variable) of about 4GB to build the Aho-Corasick
tree.
This buffer has exactly the same size as an integer vector of length
1008246640. Can you allocate such vector? Try:

   > x <- integer(1008246640)

Given that you have 20GB of RAM, this should work, unless something
is wrong with your R installation...

More about the "fixed-size temporary buffer" approach:

The size of this buffer is chosen in a way so that it is guaranteed to
be big enough to store the entire Aho-Corasick tree with no need of
reallocation. It may be that the real size of this tree will in fact
be smaller (sometimes much smaller) than the size of the temporary
buffer but AFAICS there is no easy way to know this in advance.
The real size of the tree (in bytes) can be obtained with:

   > length(NM_seq_pDict at actree) * 32

Note that the formula used to compute the size of the buffer only depends
on the length and width of the input dictionary and that this formula
is an optimal a priori estimate in the sense that it is possible
that the tree will fill up the temp buffer entirely.

We chose to use a fixed-size temporary buffer for the construction
of the AC tree because we wanted to make PDict() as fast as possible
at the cost of some increased memory requirement. The current approach
is not written in stone though and we might change this in the future.
Maybe a better approach would be to do some sort of compromise by choosing
a buffer size that is 50% of the best a priori estimate and do 1
reallocation if the temp buffer happens to be too small with the hope
that this will be a rare situation when using real-world data.
But more expertise will be needed before we can choose the good ratio
(50% ? 25% ? 75% ?...)

Cheers,
H.

Quoting "Joseph Dhahbi, P.h.D." <jdhahbi at chori.org>:

> Hello
> I need help on how to get around the memory error reported below,
> especially when I can not add anymore RAM:
> Here is the Hardware Overview:
>   Model Name:	Mac Pro
>   Model Identifier:	MacPro1,1
>   Processor Name:	Dual-Core Intel Xeon
>   Processor Speed:	2.66 GHz
>   Number Of Processors:	2
>   Total Number Of Cores:	4
>   L2 Cache (per processor):	4 MB
>   Memory:	20 GB
>   Bus Speed:	1.33 GHz
>   Boot ROM Version:	MP11.005C.B08
>   SMC Version:	1.7f10
>   Serial Number:	G87052SGUPZ
>
>
>
>> NM_seq=readSolexaFastA(NM_fa)
>> NM_alf=alphabetFrequency(NM_seq, baseOnly=TRUE)
>> NM_seq_clean = NM_seq[NM_alf[,"other"]==0]
>> length(NM_seq)
> [1] 4820218
>> length(NM_seq_clean)
> [1] 4817537
>> NM_seq_clean
>   A DNAStringSet instance of length 4817537
>           width seq
>       [1]    36 GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTGGAT
>       [2]    36 GTGGTAATTCATCAGATCTCGGATGGCATTGGTCAT
>       [3]    36 GGGAGGTCACTAATGGAGACACACAGAAATGTAACA
>       [4]    36 GGGATTGGTTTTTTGTTACTGATTTGTTTGAGTTCA
>       [5]    36 GTGGTAATTTTGACTTTTTAGGTTAATTTATTTTTT
>       [6]    36 GATCGGAAGGAGCTCGTATGCCGTCTTCTGCTTAGA
>       [7]    36 GGTCAGTTGTGTTCTCCTGAGTAGGTTGTGTGAATG
>       [8]    36 GGGAGGTCACTAATGGAGACACACAGAAATGTAACA
>       [9]    36 GGGAGGCTGAGGCAGGAGAATGGCATGAACCTAGAT
>       ...   ... ...
> [4817529]    36 TTCAAGTCCGCCATGCCCGAAGGCTACGTCCAGGAG
> [4817530]    36 CATCAATGTATCTTAAGGCGTAAATTGTAAGCGTTA
> [4817531]    36 CGAGCAGCGACGCATCACCCAGCTAGATCGGAAGAG
> [4817532]    36 GCAATGCCACTGGCGCGACAACCGGGACACCATAGG
> [4817533]    36 CCTCGCCGGACACGCTGAACTTGTGGCCGTTTTCGT
> [4817534]    36 CCATTGTACAACGTATCGACATATCCTCCACCCGCC
> [4817535]    36 CCCCCTGAACCTGAAACATAAAATGAATGCAATTGT
> [4817536]    36 ACCATGTTGTCCAAGGGCGAATTCTGCAGATATCCA
> [4817537]    36 CAGGGGCCGGCGGCTGGCTAGGGCTGCAGCGTTAAA
>
>> NM_seq_pDict=PDict(NM_seq_clean)
> Error in .PDict(dict, names(dict), tb.start, tb.end, drop.head, drop.tail,  :
>   alloc_actree_nodes_buf(): failed to alloc actree_nodes_buf
> R(433,0xa000d000) malloc: *** vm_allocate(size=4032987136) failed
> (error code=3)
> R(433,0xa000d000) malloc: *** error: can't allocate region
> R(433,0xa000d000) malloc: *** set a breakpoint in szone_error to debug
>
>> sessionInfo()
> R version 2.7.0 (2008-04-22)
> i386-apple-darwin8.10.1
>
> locale:
> en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] tools     stats     graphics  grDevices utils    datasets  methods   base
>
> other attached packages:
> [1] BiostringsCinterfaceDemo_0.1.2 Biostrings_2.8.9                
> Biobase_2.0.1
>
>
>
>
> Regards,
> Joseph
>
> Joseph M. Dhahbi, PhD
> Childrens Hospital Oakland Research Institute
> 5700 Martin Luther King Jr. Way
> Oakland, CA 94609
> USA
> Ph.(510)428-3885 EXT.5743
> Cell.(702)335-0795
> Fax (510)450-7910
> jdhahbi at chori.org
> The email message (and any attachments) is for the sol...{{dropped:9}}



More information about the Bioc-sig-sequencing mailing list