[R] Finding unique elements faster

Mon Dec 8 23:16:23 CET 2014

On 8 Dec 2014, at 21:21, apeshifter <ch_koch at gmx.de> wrote:

> The last relic of the afore-mentioned for-loop that goes through all the
> word pairs and tries to calculate some statistics on them is the following
> line of code:
>> typefreq.after1[i]<-length(unique(word2[which(word1==word1[i])]))
> (where word1 and word2 are the first and second word within the two-word
> sequence (all.word.pairs, above)

It is difficult to tell without a fully reproducible example, but from this code I get the impression that word1 and word2 represent word pair _tokens_ rather than pair _types_ (otherwise you wouldn't need the unique()).  That's a very inefficient way of dealing with co-occurrence data, especially since you've already computed the set of pair types in order to get the co-occurrence counts.

If word1, word2 are type vectors (i.e. every pair occurs just once), then this should give you what you want:

	tapply(BB$word2, BB$word1, length)

If they are token vectors, you need to supply your own type counting function, which will be a bit slower

	tapply(BB$word2, BB$word1, function (x) length(unique(x)))

On my machine, this takes about 0.2s for 770,000 word pairs.

BTW, you might want to take a look at Unit 4 of the SIGIL course

	http://sigil.r-forge.r-project.org/

which has some tips on how you can deal efficiently with co-occurrence data in R.

Best,
Stefan