[R] Inserting 17M entries into env took 18h, inserting 34M entries taking 5+ days
Martin Morgan
mtmorgan at fhcrc.org
Fri Nov 1 23:12:07 CET 2013
On 11/01/2013 08:22 AM, Magnus Thor Torfason wrote:
> Sure,
>
> I was attempting to be concise and boiling it down to what I saw as the root
> issue, but you are right, I could have taken it a step further. So here goes.
>
> I have a set of around around 20M string pairs. A given string (say, A) can
> either be equivalent to another string (B) or not. If A and B occur together in
> the same pair, they are equivalent. But equivalence is transitive, so if A and B
> occur together in one pair, and A and C occur together in another pair, then A
> and C are also equivalent. I need a way to quickly determine if any two strings
> from my data set are equivalent or not.
Do you mean that if A,B occur together and B,C occur together, then A,B and A,C
are equivalent?
Here's a function that returns a unique identifier (not well tested!), allowing
for transitive relations but not circularity.
uid <- function(x, y)
{
i <- seq_along(x) # global index
xy <- paste0(x, y) # make unique identifiers
idx <- match(xy, xy)
repeat {
## transitive look-up
y_idx <- match(y[idx], x) # look up 'y' in 'x'
keep <- !is.na(y_idx)
if (!any(keep)) # no transitive relations, done!
break
x[idx[keep]] <- x[y_idx[keep]]
y[idx[keep]] <- y[y_idx[keep]]
## create new index of values
xy <- paste0(x, y)
idx <- match(xy, xy)
}
idx
}
Values with the same index are identical. Some tests
> x <- c(1, 2, 3, 4)
> y <- c(2, 3, 5, 6)
> uid(x, y)
[1] 1 1 1 4
> i <- sample(x); uid(x[i], y[i])
[1] 1 1 3 1
> uid(as.character(x), as.character(y)) ## character() ok
[1] 1 1 1 4
> uid(1:10, 1 + 1:10)
[1] 1 1 1 1 1 1 1 1 1 1
> uid(integer(), integer())
integer(0)
> x <- c(1, 2, 3)
> y <- c(2, 3, 1)
> uid(x, y) ## circular!
C-c C-c
I think this will scale well enough, but the worst-case scenario can be made to
be log(longest chain) and copying can be reduced by using an index i and
subsetting the original vector on each iteration. I think you could test for
circularity by checking that the updated x are not a permutation of the kept x,
all(x[y_idx[keep]] %in% x[keep]))
Martin
>
> The way I do this currently is to designate the smallest (alphabetically) string
> in each known equivalence set as the "main" entry. For each pair, I therefore
> insert two entries into the hash table, both pointing at the mail value. So
> assuming the input data:
>
> A,B
> B,C
> D,E
>
> I would then have:
>
> A->A
> B->A
> C->B
> D->D
> E->D
>
> Except that I also follow each chain until I reach the end (key==value), and
> insert pointers to the "main" value for every value I find along the way. After
> doing that, I end up with:
>
> A->A
> B->A
> C->A
> D->D
> E->D
>
> And I can very quickly check equivalence, either by comparing the hash of two
> strings, or simply by transforming each string into its hash, and then I can use
> simple comparison from then on. The code for generating the final hash table is
> as follows:
>
> h : Empty hash table created with hash.new()
> d : Input data
> hash.deep.get : Function that iterates through the hash table until it finds a
> key whose value is equal to itself (until hash.get(X)==X), then returns all the
> values in a vector
>
>
> h = hash.new()
> for ( i in 1:nrow(d) )
> {
> deep.a = hash.deep.get(h, d$a[i])
> deep.b = hash.deep.get(h, d$b[i])
> equivalents = sort(unique(c(deep.a,deep.b)))
> equiv.id = min(equivalents)
> for ( equivalent in equivalents )
> {
> hash.put(h, equivalent, equiv.id)
> }
> }
>
>
> I would so much appreciate if there was a simpler and faster way to do this.
> Keeping my fingers crossed that one of the R-help geniuses who sees this is
> sufficiently interested to crack the problem
>
> Best,
> Magnus
>
> On 11/1/2013 1:49 PM, jim holtman wrote:
>> It would be nice if you followed the posting guidelines and at least
>> showed the script that was creating your entries now so that we
>> understand the problem you are trying to solve. A bit more
>> explanation of why you want this would be useful. This gets to the
>> second part of my tag line: Tell me what you want to do, not how you
>> want to do it. There may be other solutions to your problem.
>>
>> Jim Holtman
>> Data Munger Guru
>>
>> What is the problem that you are trying to solve?
>> Tell me what you want to do, not how you want to do it.
>>
>>
>> On Fri, Nov 1, 2013 at 9:32 AM, Magnus Thor Torfason
>> <zulutime.net at gmail.com> wrote:
>>> Pretty much what the subject says:
>>>
>>> I used an env as the basis for a Hashtable in R, based on information that
>>> this is in fact the way environments are implemented under the hood.
>>>
>>> I've been experimenting with doubling the number of entries, and so far it
>>> has seemed to be scaling more or less linearly, as expected.
>>>
>>> But as I went from 17 million entries to 34 million entries, the completion
>>> time has gone from 18 hours, to 5 days and counting.
>>>
>>>
>>> The keys and values are in all cases strings of equal length.
>>>
>>> One might suspect that the slow-down might have to do with the memory being
>>> swapped to disk, but from what I know about my computing environment, that
>>> should not be the case.
>>>
>>> So my first question:
>>> Is anyone familiar with anything in the implementation of environments that
>>> would limit their use or slow them down (faster than O(nlog(n)) as the
>>> number of entries is increased?
>>>
>>> And my second question:
>>> I realize that this is not strictly what R environments were designed for,
>>> but this is what my algorithm requires: I must go through these millions of
>>> entries, storing them in the hash table and sometimes retrieving them along
>>> the way, in a more or less random manner, which is contingent on the data I
>>> am encountering, and on the contents of the hash table at each moment.
>>>
>>> Does anyone have a good recommendation for alternatives to implement huge,
>>> fast, table-like structures in R?
>>>
>>> Best,
>>> Magnus
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the R-help
mailing list