[R] Inserting 17M entries into env took 18h, inserting 34M entries taking 5+ days

Magnus Thor Torfason zulutime.net at gmail.com
Mon Nov 4 13:59:15 CET 2013



On 11/1/2013 10:12 PM, Martin Morgan wrote:
> Do you mean that if A,B occur together and B,C occur together, then A,B
> and A,C are equivalent?

Yes, that's what I meant, sorry, typo.

I like your uid() function. It avoids the 20M times loop, and the issue 
of circular references can be solved by ensuring that y is always 
(weakly) smaller than x.

I can see how the log(longest chain) would be the limiting case, since 
each round should always halve the length of the longest chain.

I have yet to test whether that intuition (about running time) works in 
practice.

I did test the program on a different data set, and it does seem that it 
fails to detect equivalence on the following input:

   a b
1 B A
2 C B
3 O G
4 O M
5 Y C
6 Y M

Everything should be equivalent here, but

 > uid(d$a, d$b)
[1] 1 1 3 4 1 6

I'm looking to see what can be about this. The problem seems to be with
entries of the form:

O->G
O->M

Which imply that all of O/M/G are equivalent, but they are not detected 
as such. Will consider whether there is a good way around this.

Best,
Magnus

> Here's a function that returns a unique identifier (not well tested!),
> allowing for transitive relations but not circularity.
>
>       uid <- function(x, y)
>      {
>          i <- seq_along(x)                   # global index
>          xy <- paste0(x, y)                  # make unique identifiers
>          idx <- match(xy, xy)
>
>          repeat {
>              ## transitive look-up
>              y_idx <- match(y[idx], x)       # look up 'y' in 'x'
>              keep <- !is.na(y_idx)
>              if (!any(keep))                 # no transitive relations,
> done!
>                  break
>              x[idx[keep]] <- x[y_idx[keep]]
>              y[idx[keep]] <- y[y_idx[keep]]
>
>              ## create new index of values
>              xy <- paste0(x, y)
>              idx <- match(xy, xy)
>          }
>          idx
>      }
>
> Values with the same index are identical. Some tests
>
>      > x <- c(1, 2, 3, 4)
>      > y <- c(2, 3, 5, 6)
>      > uid(x, y)
>      [1] 1 1 1 4
>      > i <- sample(x); uid(x[i], y[i])
>      [1] 1 1 3 1
>      > uid(as.character(x), as.character(y))  ## character() ok
>      [1] 1 1 1 4
>      > uid(1:10, 1 + 1:10)
>       [1] 1 1 1 1 1 1 1 1 1 1
>      > uid(integer(), integer())
>      integer(0)
>      > x <- c(1, 2, 3)
>      > y <- c(2, 3, 1)
>      > uid(x, y)                              ## circular!
>        C-c C-c
>
> I think this will scale well enough, but the worst-case scenario can be
> made to be log(longest chain) and copying can be reduced by using an
> index i and subsetting the original vector on each iteration. I think
> you could test for circularity by checking that the updated x are not a
> permutation of the kept x, all(x[y_idx[keep]] %in% x[keep]))
>
> Martin
>
>>
>> The way I do this currently is to designate the smallest
>> (alphabetically) string
>> in each known equivalence set as the "main" entry. For each pair, I
>> therefore
>> insert two entries into the hash table, both pointing at the mail
>> value. So
>> assuming the input data:
>>
>> A,B
>> B,C
>> D,E
>>
>> I would then have:
>>
>> A->A
>> B->A
>> C->B
>> D->D
>> E->D
>>
>> Except that I also follow each chain until I reach the end
>> (key==value), and
>> insert pointers to the "main" value for every value I find along the
>> way. After
>> doing that, I end up with:
>>
>> A->A
>> B->A
>> C->A
>> D->D
>> E->D
>>
>> And I can very quickly check equivalence, either by comparing the hash
>> of two
>> strings, or simply by transforming each string into its hash, and then
>> I can use
>> simple comparison from then on. The code for generating the final hash
>> table is
>> as follows:
>>
>> h : Empty hash table created with hash.new()
>> d : Input data
>> hash.deep.get : Function that iterates through the hash table until it
>> finds a
>> key whose value is equal to itself (until hash.get(X)==X), then
>> returns all the
>> values in a vector
>>
>>
>> h = hash.new()
>> for ( i in 1:nrow(d) )
>> {
>>      deep.a      = hash.deep.get(h, d$a[i])
>>      deep.b      = hash.deep.get(h, d$b[i])
>>      equivalents = sort(unique(c(deep.a,deep.b)))
>>      equiv.id    = min(equivalents)
>>      for ( equivalent in equivalents )
>>      {
>>          hash.put(h, equivalent, equiv.id)
>>      }
>> }
>>
>>
>> I would so much appreciate if there was a simpler and faster way to do
>> this.
>> Keeping my fingers crossed that one of the R-help geniuses who sees
>> this is
>> sufficiently interested to crack the problem
>>
>> Best,
>> Magnus
>>
>> On 11/1/2013 1:49 PM, jim holtman wrote:
>>> It would be nice if you followed the posting guidelines and at least
>>> showed the script that was creating your entries now so that we
>>> understand the problem you are trying to solve.  A bit more
>>> explanation of why you want this would be useful.  This gets to the
>>> second part of my tag line:  Tell me what you want to do, not how you
>>> want to do it.  There may be other solutions to your problem.
>>>
>>> Jim Holtman
>>> Data Munger Guru
>>>
>>> What is the problem that you are trying to solve?
>>> Tell me what you want to do, not how you want to do it.
>>>
>>>
>>> On Fri, Nov 1, 2013 at 9:32 AM, Magnus Thor Torfason
>>> <zulutime.net at gmail.com> wrote:
>>>> Pretty much what the subject says:
>>>>
>>>> I used an env as the basis for a Hashtable in R, based on
>>>> information that
>>>> this is in fact the way environments are implemented under the hood.
>>>>
>>>> I've been experimenting with doubling the number of entries, and so
>>>> far it
>>>> has seemed to be scaling more or less linearly, as expected.
>>>>
>>>> But as I went from 17 million entries to 34 million entries, the
>>>> completion
>>>> time has gone from 18 hours, to 5 days and counting.
>>>>
>>>>
>>>> The keys and values are in all cases strings of equal length.
>>>>
>>>> One might suspect that the slow-down might have to do with the
>>>> memory being
>>>> swapped to disk, but from what I know about my computing
>>>> environment, that
>>>> should not be the case.
>>>>
>>>> So my first question:
>>>> Is anyone familiar with anything in the implementation of
>>>> environments that
>>>> would limit their use or slow them down (faster than O(nlog(n)) as the
>>>> number of entries is increased?
>>>>
>>>> And my second question:
>>>> I realize that this is not strictly what R environments were
>>>> designed for,
>>>> but this is what my algorithm requires: I must go through these
>>>> millions of
>>>> entries, storing them in the hash table and sometimes retrieving
>>>> them along
>>>> the way, in a more or less random manner, which is contingent on the
>>>> data I
>>>> am encountering, and on the contents of the hash table at each moment.
>>>>
>>>> Does anyone have a good recommendation for alternatives to implement
>>>> huge,
>>>> fast, table-like structures in R?
>>>>
>>>> Best,
>>>> Magnus
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>



More information about the R-help mailing list