[R] Inserting 17M entries into env took 18h, inserting 34M entries taking 5+ days

Mon Nov 4 14:05:00 CET 2013

Yes, I'm pretty familiar with igraph, but had not thought of using it 
for this. Interesting idea.

So I presume I'd do something like (pseudocode):

:Create igraph object from my data set as an edgelist
:Create a list of connected subgraphs using clusters()
:Loop through that list of clusters to create final data set

That might work ...

Thanks!

Magnus

On 11/1/2013 4:52 PM, William Dunlap wrote:
> Have you looked into the 'igraph' package?
>
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com
>
>
>> -----Original Message-----
>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf
>> Of Magnus Thor Torfason
>> Sent: Friday, November 01, 2013 8:23 AM
>> To: r-help at r-project.org
>> Subject: Re: [R] Inserting 17M entries into env took 18h, inserting 34M entries taking 5+
>> days
>>
>> Sure,
>>
>> I was attempting to be concise and boiling it down to what I saw as the
>> root issue, but you are right, I could have taken it a step further. So
>> here goes.
>>
>> I have a set of around around 20M string pairs. A given string (say, A)
>> can either be equivalent to another string (B) or not. If A and B occur
>> together in the same pair, they are equivalent. But equivalence is
>> transitive, so if A and B occur together in one pair, and A and C occur
>> together in another pair, then A and C are also equivalent. I need a way
>> to quickly determine if any two strings from my data set are equivalent
>> or not.
>>
>> The way I do this currently is to designate the smallest
>> (alphabetically) string in each known equivalence set as the "main"
>> entry. For each pair, I therefore insert two entries into the hash
>> table, both pointing at the mail value. So assuming the input data:
>>
>> A,B
>> B,C
>> D,E
>>
>> I would then have:
>>
>> A->A
>> B->A
>> C->B
>> D->D
>> E->D
>>
>> Except that I also follow each chain until I reach the end (key==value),
>> and insert pointers to the "main" value for every value I find along the
>> way. After doing that, I end up with:
>>
>> A->A
>> B->A
>> C->A
>> D->D
>> E->D
>>
>> And I can very quickly check equivalence, either by comparing the hash
>> of two strings, or simply by transforming each string into its hash, and
>> then I can use simple comparison from then on. The code for generating
>> the final hash table is as follows:
>>
>> h : Empty hash table created with hash.new()
>> d : Input data
>> hash.deep.get : Function that iterates through the hash table until it
>> finds a key whose value is equal to itself (until hash.get(X)==X), then
>> returns all the values in a vector
>>
>>
>> h = hash.new()
>> for ( i in 1:nrow(d) )
>> {
>>       deep.a      = hash.deep.get(h, d$a[i])
>>       deep.b      = hash.deep.get(h, d$b[i])
>>       equivalents = sort(unique(c(deep.a,deep.b)))
>>       equiv.id    = min(equivalents)
>>       for ( equivalent in equivalents )
>>       {
>>           hash.put(h, equivalent, equiv.id)
>>       }
>> }
>>
>>
>> I would so much appreciate if there was a simpler and faster way to do
>> this. Keeping my fingers crossed that one of the R-help geniuses who
>> sees this is sufficiently interested to crack the problem
>>
>> Best,
>> Magnus
>>
>> On 11/1/2013 1:49 PM, jim holtman wrote:
>>> It would be nice if you followed the posting guidelines and at least
>>> showed the script that was creating your entries now so that we
>>> understand the problem you are trying to solve.  A bit more
>>> explanation of why you want this would be useful.  This gets to the
>>> second part of my tag line:  Tell me what you want to do, not how you
>>> want to do it.  There may be other solutions to your problem.
>>>
>>> Jim Holtman
>>> Data Munger Guru
>>>
>>> What is the problem that you are trying to solve?
>>> Tell me what you want to do, not how you want to do it.
>>>
>>>
>>> On Fri, Nov 1, 2013 at 9:32 AM, Magnus Thor Torfason
>>> <zulutime.net at gmail.com> wrote:
>>>> Pretty much what the subject says:
>>>>
>>>> I used an env as the basis for a Hashtable in R, based on information that
>>>> this is in fact the way environments are implemented under the hood.
>>>>
>>>> I've been experimenting with doubling the number of entries, and so far it
>>>> has seemed to be scaling more or less linearly, as expected.
>>>>
>>>> But as I went from 17 million entries to 34 million entries, the completion
>>>> time has gone from 18 hours, to 5 days and counting.
>>>>
>>>>
>>>> The keys and values are in all cases strings of equal length.
>>>>
>>>> One might suspect that the slow-down might have to do with the memory being
>>>> swapped to disk, but from what I know about my computing environment, that
>>>> should not be the case.
>>>>
>>>> So my first question:
>>>> Is anyone familiar with anything in the implementation of environments that
>>>> would limit their use or slow them down (faster than O(nlog(n)) as the
>>>> number of entries is increased?
>>>>
>>>> And my second question:
>>>> I realize that this is not strictly what R environments were designed for,
>>>> but this is what my algorithm requires: I must go through these millions of
>>>> entries, storing them in the hash table and sometimes retrieving them along
>>>> the way, in a more or less random manner, which is contingent on the data I
>>>> am encountering, and on the contents of the hash table at each moment.
>>>>
>>>> Does anyone have a good recommendation for alternatives to implement huge,
>>>> fast, table-like structures in R?
>>>>
>>>> Best,
>>>> Magnus
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> .
>