[R] Hashing and environments

Sun Nov 7 05:08:04 CET 2010

Wow, that is perfect: the hash package is exactly what I needed.  Thank you!

Roger

On Nov 6, 2010, at 4:09 PM, Kjetil Halvorsen wrote:

> some of this can be automated using the CRAN package
> hash.
> 
> Kjetil
> 
> On Sat, Nov 6, 2010 at 10:43 PM, William Dunlap <wdunlap at tibco.com> wrote:
>> I would make make an environemnt called wfreqsEnv
>> whose entry names are your words and whose entry
>> values are the information about the words.  I find
>> it convenient to use [[ to make it appear to be
>> a list (instead of using exists(), assign(), and get()).
>> E.g., the following enters the 100,000 words from a
>> list of 17,576 and records their id numbers and the
>> number of times each is found in the sample.
>> 
>>> wfreqsEnv <- new.env(hash=TRUE, parent = emptyenv())
>>> words <- do.call("paste", c(list(sep=""), expand.grid(LETTERS,
>> letters, letters)))
>> # length(words) == 17576
>>> set.seed(1)
>>> samp <- sample(seq_along(words), size=100000, replace=TRUE)
>>> system.time(for(i in samp) {
>> +    word <- words[i]
>> +    if (is.null(wfreqsEnv[[word]])) { # new entry
>> +        wfreqsEnv[[word]] <- list(Count=1, EntryNo=i)
>> +    } else { # update existing entry
>> +        wfreqsEnv[[word]]$Count <- wfreqsEnv[[word]]$Count + 1
>> +    }
>> +})
>>   user  system elapsed
>>   2.28    0.00    2.14
>> (The time, in seconds, is from an ancient Windows laptop, c. 2002.)
>> 
>> Here is a small check that we are getting what we expect:
>>> words[14736]
>> [1] "Tuv"
>>> wfreqsEnv[["Tuv"]]
>> $Count
>> [1] 8
>> 
>> $EntryNo
>> [1] 14736
>> 
>>> sum(samp==14736)
>> [1] 8
>> 
>> If we do this with a non-hashed environment we get the same
>> answers but the elapsed time is now 34.81 seconds instead of
>> 2.14.  If you make wfreqEnv be a list instead of an environment
>> then that time is 74.12 seconds (and the answers are the same).
>> 
>> Bill Dunlap
>> Spotfire, TIBCO Software
>> wdunlap tibco.com
>> 
>>> -----Original Message-----
>>> From: r-help-bounces at r-project.org
>>> [mailto:r-help-bounces at r-project.org] On Behalf Of Levy, Roger
>>> Sent: Saturday, November 06, 2010 1:39 PM
>>> To: r-help at r-project.org
>>> Subject: [R] Hashing and environments
>>> 
>>> Hi,
>>> 
>>> I'm trying to write a general-purpose "lexicon" class and
>>> associated methods for storing and accessing information
>>> about large numbers of specific words (e.g., their
>>> frequencies in different genres).  Crucial to making such a
>>> class practically useful is to get hashing working correctly
>>> so that information about specific words can be accessed
>>> quickly.  But I've never really understood very well how
>>> hashing works, so I'm having trouble.
>>> 
>>> Here is an example of what I've done so far:
>>> 
>>> ***
>>> 
>>> setClass("Lexicon",representation(e="environment"))
>>> setMethod("initialize","Lexicon",function(.Object,wfreqs) {
>>>       .Object at e <- new.env(hash=T,parent=emptyenv())
>>>       assign("wfreqs",wfreqs,envir=.Object at e)
>>>       return(.Object)
>>>       })
>>> 
>>> ## function to access word frequencies
>>> wfreq <- function(lexicon,word) {
>>>       return(get("wfreqs",envir=lexicon at e)[word])
>>> }
>>> 
>>> ## example of use
>>> my.lexicon <- new("Lexicon",wfreqs=c("the"=2,"person"=1))
>>> wfreq(my.lexicon,"the")
>>> 
>>> ***
>>> 
>>> However, testing indicates that the way I have set this up
>>> does not achieve the intended benefits of having the
>>> environment hashed:
>>> 
>>> ***
>>> 
>>> sample.wfreqs <- trunc(runif(1e5,max=100))
>>> names(sample.wfreqs) <- as.character(1:length(sample.wfreqs))
>>> lex <- new("Lexicon",wfreqs=sample.wfreqs)
>>> words.to.lookup <- trunc(runif(100,min=1,max=1e5))
>>> ## look up the words directly from the sample.wfreqs vector
>>> system.time({
>>>       for(i in words.to.lookup)
>>>               sample.wfreqs[as.character(i)]
>>>       },gcFirst=TRUE)
>>> ## look up the words through the wfreq() function; time
>>> approx the same
>>> system.time({
>>>       for(i in words.to.lookup)
>>>               wfreq(lex,as.character(i))
>>>       },gcFirst=TRUE)
>>> 
>>> ***
>>> 
>>> I'm guessing that the problem is that the indexing of the
>>> wfreqs vector in my wfreq() function is not happening inside
>>> the actual lexicon's environment.  However, I have not been
>>> able to figure out the proper call to get the lookup to
>>> happen inside the lexicon's environment.  I've tried
>>> 
>>> wfreq1 <- function(lexicon,word) {
>>>       return(eval(wfreqs[word],envir=lexicon at e))
>>> }
>>> 
>>> which I'd thought should work, but this gives me an error:
>>> 
>>>> wfreq1(my.lexicon,'the')
>>> Error in eval(wfreqs[word], envir = lexicon at e) :
>>>   object 'wfreqs' not found
>>> 
>>> Any advice would be much appreciated!
>>> 
>>> Best & many thanks in advance,
>>> 
>>> Roger
>>> 
>>> --
>>> 
>>> Roger Levy                      Email: rlevy at ucsd.edu
>>> Assistant Professor             Phone: 858-534-7219
>>> Department of Linguistics       Fax:   858-534-4789
>>> UC San Diego                    Web:   http://ling.ucsd.edu/~rlevy
>>> 
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>> 
>> 
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>> 

--

Roger Levy                      Email: rlevy at ucsd.edu
Assistant Professor             Phone: 858-534-7219
Department of Linguistics       Fax:   858-534-4789
UC San Diego                    Web:   http://ling.ucsd.edu/~rlevy