[R] Hashing and environments
William Dunlap
wdunlap at tibco.com
Sat Nov 6 23:43:36 CET 2010
I would make make an environemnt called wfreqsEnv
whose entry names are your words and whose entry
values are the information about the words. I find
it convenient to use [[ to make it appear to be
a list (instead of using exists(), assign(), and get()).
E.g., the following enters the 100,000 words from a
list of 17,576 and records their id numbers and the
number of times each is found in the sample.
> wfreqsEnv <- new.env(hash=TRUE, parent = emptyenv())
> words <- do.call("paste", c(list(sep=""), expand.grid(LETTERS,
letters, letters)))
# length(words) == 17576
> set.seed(1)
> samp <- sample(seq_along(words), size=100000, replace=TRUE)
> system.time(for(i in samp) {
+ word <- words[i]
+ if (is.null(wfreqsEnv[[word]])) { # new entry
+ wfreqsEnv[[word]] <- list(Count=1, EntryNo=i)
+ } else { # update existing entry
+ wfreqsEnv[[word]]$Count <- wfreqsEnv[[word]]$Count + 1
+ }
+})
user system elapsed
2.28 0.00 2.14
(The time, in seconds, is from an ancient Windows laptop, c. 2002.)
Here is a small check that we are getting what we expect:
> words[14736]
[1] "Tuv"
> wfreqsEnv[["Tuv"]]
$Count
[1] 8
$EntryNo
[1] 14736
> sum(samp==14736)
[1] 8
If we do this with a non-hashed environment we get the same
answers but the elapsed time is now 34.81 seconds instead of
2.14. If you make wfreqEnv be a list instead of an environment
then that time is 74.12 seconds (and the answers are the same).
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
> -----Original Message-----
> From: r-help-bounces at r-project.org
> [mailto:r-help-bounces at r-project.org] On Behalf Of Levy, Roger
> Sent: Saturday, November 06, 2010 1:39 PM
> To: r-help at r-project.org
> Subject: [R] Hashing and environments
>
> Hi,
>
> I'm trying to write a general-purpose "lexicon" class and
> associated methods for storing and accessing information
> about large numbers of specific words (e.g., their
> frequencies in different genres). Crucial to making such a
> class practically useful is to get hashing working correctly
> so that information about specific words can be accessed
> quickly. But I've never really understood very well how
> hashing works, so I'm having trouble.
>
> Here is an example of what I've done so far:
>
> ***
>
> setClass("Lexicon",representation(e="environment"))
> setMethod("initialize","Lexicon",function(.Object,wfreqs) {
> .Object at e <- new.env(hash=T,parent=emptyenv())
> assign("wfreqs",wfreqs,envir=.Object at e)
> return(.Object)
> })
>
> ## function to access word frequencies
> wfreq <- function(lexicon,word) {
> return(get("wfreqs",envir=lexicon at e)[word])
> }
>
> ## example of use
> my.lexicon <- new("Lexicon",wfreqs=c("the"=2,"person"=1))
> wfreq(my.lexicon,"the")
>
> ***
>
> However, testing indicates that the way I have set this up
> does not achieve the intended benefits of having the
> environment hashed:
>
> ***
>
> sample.wfreqs <- trunc(runif(1e5,max=100))
> names(sample.wfreqs) <- as.character(1:length(sample.wfreqs))
> lex <- new("Lexicon",wfreqs=sample.wfreqs)
> words.to.lookup <- trunc(runif(100,min=1,max=1e5))
> ## look up the words directly from the sample.wfreqs vector
> system.time({
> for(i in words.to.lookup)
> sample.wfreqs[as.character(i)]
> },gcFirst=TRUE)
> ## look up the words through the wfreq() function; time
> approx the same
> system.time({
> for(i in words.to.lookup)
> wfreq(lex,as.character(i))
> },gcFirst=TRUE)
>
> ***
>
> I'm guessing that the problem is that the indexing of the
> wfreqs vector in my wfreq() function is not happening inside
> the actual lexicon's environment. However, I have not been
> able to figure out the proper call to get the lookup to
> happen inside the lexicon's environment. I've tried
>
> wfreq1 <- function(lexicon,word) {
> return(eval(wfreqs[word],envir=lexicon at e))
> }
>
> which I'd thought should work, but this gives me an error:
>
> > wfreq1(my.lexicon,'the')
> Error in eval(wfreqs[word], envir = lexicon at e) :
> object 'wfreqs' not found
>
> Any advice would be much appreciated!
>
> Best & many thanks in advance,
>
> Roger
>
> --
>
> Roger Levy Email: rlevy at ucsd.edu
> Assistant Professor Phone: 858-534-7219
> Department of Linguistics Fax: 858-534-4789
> UC San Diego Web: http://ling.ucsd.edu/~rlevy
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list