[R] Inserting 17M entries into env took 18h, inserting 34M entries taking 5+ days
Magnus Thor Torfason
zulutime.net at gmail.com
Fri Nov 1 14:32:46 CET 2013
Pretty much what the subject says:
I used an env as the basis for a Hashtable in R, based on information
that this is in fact the way environments are implemented under the hood.
I've been experimenting with doubling the number of entries, and so far
it has seemed to be scaling more or less linearly, as expected.
But as I went from 17 million entries to 34 million entries, the
completion time has gone from 18 hours, to 5 days and counting.
The keys and values are in all cases strings of equal length.
One might suspect that the slow-down might have to do with the memory
being swapped to disk, but from what I know about my computing
environment, that should not be the case.
So my first question:
Is anyone familiar with anything in the implementation of environments
that would limit their use or slow them down (faster than O(nlog(n)) as
the number of entries is increased?
And my second question:
I realize that this is not strictly what R environments were designed
for, but this is what my algorithm requires: I must go through these
millions of entries, storing them in the hash table and sometimes
retrieving them along the way, in a more or less random manner, which is
contingent on the data I am encountering, and on the contents of the
hash table at each moment.
Does anyone have a good recommendation for alternatives to implement
huge, fast, table-like structures in R?
More information about the R-help