[R] Inserting 17M entries into env took 18h, inserting 34M entries taking 5+ days

Fri Nov 1 23:12:07 CET 2013

On 11/01/2013 08:22 AM, Magnus Thor Torfason wrote:
> Sure,
>
> I was attempting to be concise and boiling it down to what I saw as the root
> issue, but you are right, I could have taken it a step further. So here goes.
>
> I have a set of around around 20M string pairs. A given string (say, A) can
> either be equivalent to another string (B) or not. If A and B occur together in
> the same pair, they are equivalent. But equivalence is transitive, so if A and B
> occur together in one pair, and A and C occur together in another pair, then A
> and C are also equivalent. I need a way to quickly determine if any two strings
> from my data set are equivalent or not.

Do you mean that if A,B occur together and B,C occur together, then A,B and A,C 
are equivalent?

Here's a function that returns a unique identifier (not well tested!), allowing 
for transitive relations but not circularity.

      uid <- function(x, y)
     {
         i <- seq_along(x)                   # global index
         xy <- paste0(x, y)                  # make unique identifiers
         idx <- match(xy, xy)

         repeat {
             ## transitive look-up
             y_idx <- match(y[idx], x)       # look up 'y' in 'x'
             keep <- !is.na(y_idx)
             if (!any(keep))                 # no transitive relations, done!
                 break
             x[idx[keep]] <- x[y_idx[keep]]
             y[idx[keep]] <- y[y_idx[keep]]

             ## create new index of values
             xy <- paste0(x, y)
             idx <- match(xy, xy)
         }
         idx
     }

Values with the same index are identical. Some tests

     > x <- c(1, 2, 3, 4)
     > y <- c(2, 3, 5, 6)
     > uid(x, y)
     [1] 1 1 1 4
     > i <- sample(x); uid(x[i], y[i])
     [1] 1 1 3 1
     > uid(as.character(x), as.character(y))  ## character() ok
     [1] 1 1 1 4
     > uid(1:10, 1 + 1:10)
      [1] 1 1 1 1 1 1 1 1 1 1
     > uid(integer(), integer())
     integer(0)
     > x <- c(1, 2, 3)
     > y <- c(2, 3, 1)
     > uid(x, y)                              ## circular!
       C-c C-c

I think this will scale well enough, but the worst-case scenario can be made to 
be log(longest chain) and copying can be reduced by using an index i and 
subsetting the original vector on each iteration. I think you could test for 
circularity by checking that the updated x are not a permutation of the kept x, 
all(x[y_idx[keep]] %in% x[keep]))

Martin

>
> The way I do this currently is to designate the smallest (alphabetically) string
> in each known equivalence set as the "main" entry. For each pair, I therefore
> insert two entries into the hash table, both pointing at the mail value. So
> assuming the input data:
>
> A,B
> B,C
> D,E
>
> I would then have:
>
> A->A
> B->A
> C->B
> D->D
> E->D
>
> Except that I also follow each chain until I reach the end (key==value), and
> insert pointers to the "main" value for every value I find along the way. After
> doing that, I end up with:
>
> A->A
> B->A
> C->A
> D->D
> E->D
>
> And I can very quickly check equivalence, either by comparing the hash of two
> strings, or simply by transforming each string into its hash, and then I can use
> simple comparison from then on. The code for generating the final hash table is
> as follows:
>
> h : Empty hash table created with hash.new()
> d : Input data
> hash.deep.get : Function that iterates through the hash table until it finds a
> key whose value is equal to itself (until hash.get(X)==X), then returns all the
> values in a vector
>
>
> h = hash.new()
> for ( i in 1:nrow(d) )
> {
>      deep.a      = hash.deep.get(h, d$a[i])
>      deep.b      = hash.deep.get(h, d$b[i])
>      equivalents = sort(unique(c(deep.a,deep.b)))
>      equiv.id    = min(equivalents)
>      for ( equivalent in equivalents )
>      {
>          hash.put(h, equivalent, equiv.id)
>      }
> }
>
>
> I would so much appreciate if there was a simpler and faster way to do this.
> Keeping my fingers crossed that one of the R-help geniuses who sees this is
> sufficiently interested to crack the problem
>
> Best,
> Magnus
>
> On 11/1/2013 1:49 PM, jim holtman wrote:
>> It would be nice if you followed the posting guidelines and at least
>> showed the script that was creating your entries now so that we
>> understand the problem you are trying to solve.  A bit more
>> explanation of why you want this would be useful.  This gets to the
>> second part of my tag line:  Tell me what you want to do, not how you
>> want to do it.  There may be other solutions to your problem.
>>
>> Jim Holtman
>> Data Munger Guru
>>
>> What is the problem that you are trying to solve?
>> Tell me what you want to do, not how you want to do it.
>>
>>
>> On Fri, Nov 1, 2013 at 9:32 AM, Magnus Thor Torfason
>> <zulutime.net at gmail.com> wrote:
>>> Pretty much what the subject says:
>>>
>>> I used an env as the basis for a Hashtable in R, based on information that
>>> this is in fact the way environments are implemented under the hood.
>>>
>>> I've been experimenting with doubling the number of entries, and so far it
>>> has seemed to be scaling more or less linearly, as expected.
>>>
>>> But as I went from 17 million entries to 34 million entries, the completion
>>> time has gone from 18 hours, to 5 days and counting.
>>>
>>>
>>> The keys and values are in all cases strings of equal length.
>>>
>>> One might suspect that the slow-down might have to do with the memory being
>>> swapped to disk, but from what I know about my computing environment, that
>>> should not be the case.
>>>
>>> So my first question:
>>> Is anyone familiar with anything in the implementation of environments that
>>> would limit their use or slow them down (faster than O(nlog(n)) as the
>>> number of entries is increased?
>>>
>>> And my second question:
>>> I realize that this is not strictly what R environments were designed for,
>>> but this is what my algorithm requires: I must go through these millions of
>>> entries, storing them in the hash table and sometimes retrieving them along
>>> the way, in a more or less random manner, which is contingent on the data I
>>> am encountering, and on the contents of the hash table at each moment.
>>>
>>> Does anyone have a good recommendation for alternatives to implement huge,
>>> fast, table-like structures in R?
>>>
>>> Best,
>>> Magnus
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793