[R] Hashing and environments
Levy, Roger
rlevy at ucsd.edu
Sun Nov 7 05:08:04 CET 2010
Wow, that is perfect: the hash package is exactly what I needed. Thank you!
Roger
On Nov 6, 2010, at 4:09 PM, Kjetil Halvorsen wrote:
> some of this can be automated using the CRAN package
> hash.
>
> Kjetil
>
> On Sat, Nov 6, 2010 at 10:43 PM, William Dunlap <wdunlap at tibco.com> wrote:
>> I would make make an environemnt called wfreqsEnv
>> whose entry names are your words and whose entry
>> values are the information about the words. I find
>> it convenient to use [[ to make it appear to be
>> a list (instead of using exists(), assign(), and get()).
>> E.g., the following enters the 100,000 words from a
>> list of 17,576 and records their id numbers and the
>> number of times each is found in the sample.
>>
>>> wfreqsEnv <- new.env(hash=TRUE, parent = emptyenv())
>>> words <- do.call("paste", c(list(sep=""), expand.grid(LETTERS,
>> letters, letters)))
>> # length(words) == 17576
>>> set.seed(1)
>>> samp <- sample(seq_along(words), size=100000, replace=TRUE)
>>> system.time(for(i in samp) {
>> + word <- words[i]
>> + if (is.null(wfreqsEnv[[word]])) { # new entry
>> + wfreqsEnv[[word]] <- list(Count=1, EntryNo=i)
>> + } else { # update existing entry
>> + wfreqsEnv[[word]]$Count <- wfreqsEnv[[word]]$Count + 1
>> + }
>> +})
>> user system elapsed
>> 2.28 0.00 2.14
>> (The time, in seconds, is from an ancient Windows laptop, c. 2002.)
>>
>> Here is a small check that we are getting what we expect:
>>> words[14736]
>> [1] "Tuv"
>>> wfreqsEnv[["Tuv"]]
>> $Count
>> [1] 8
>>
>> $EntryNo
>> [1] 14736
>>
>>> sum(samp==14736)
>> [1] 8
>>
>> If we do this with a non-hashed environment we get the same
>> answers but the elapsed time is now 34.81 seconds instead of
>> 2.14. If you make wfreqEnv be a list instead of an environment
>> then that time is 74.12 seconds (and the answers are the same).
>>
>> Bill Dunlap
>> Spotfire, TIBCO Software
>> wdunlap tibco.com
>>
>>> -----Original Message-----
>>> From: r-help-bounces at r-project.org
>>> [mailto:r-help-bounces at r-project.org] On Behalf Of Levy, Roger
>>> Sent: Saturday, November 06, 2010 1:39 PM
>>> To: r-help at r-project.org
>>> Subject: [R] Hashing and environments
>>>
>>> Hi,
>>>
>>> I'm trying to write a general-purpose "lexicon" class and
>>> associated methods for storing and accessing information
>>> about large numbers of specific words (e.g., their
>>> frequencies in different genres). Crucial to making such a
>>> class practically useful is to get hashing working correctly
>>> so that information about specific words can be accessed
>>> quickly. But I've never really understood very well how
>>> hashing works, so I'm having trouble.
>>>
>>> Here is an example of what I've done so far:
>>>
>>> ***
>>>
>>> setClass("Lexicon",representation(e="environment"))
>>> setMethod("initialize","Lexicon",function(.Object,wfreqs) {
>>> .Object at e <- new.env(hash=T,parent=emptyenv())
>>> assign("wfreqs",wfreqs,envir=.Object at e)
>>> return(.Object)
>>> })
>>>
>>> ## function to access word frequencies
>>> wfreq <- function(lexicon,word) {
>>> return(get("wfreqs",envir=lexicon at e)[word])
>>> }
>>>
>>> ## example of use
>>> my.lexicon <- new("Lexicon",wfreqs=c("the"=2,"person"=1))
>>> wfreq(my.lexicon,"the")
>>>
>>> ***
>>>
>>> However, testing indicates that the way I have set this up
>>> does not achieve the intended benefits of having the
>>> environment hashed:
>>>
>>> ***
>>>
>>> sample.wfreqs <- trunc(runif(1e5,max=100))
>>> names(sample.wfreqs) <- as.character(1:length(sample.wfreqs))
>>> lex <- new("Lexicon",wfreqs=sample.wfreqs)
>>> words.to.lookup <- trunc(runif(100,min=1,max=1e5))
>>> ## look up the words directly from the sample.wfreqs vector
>>> system.time({
>>> for(i in words.to.lookup)
>>> sample.wfreqs[as.character(i)]
>>> },gcFirst=TRUE)
>>> ## look up the words through the wfreq() function; time
>>> approx the same
>>> system.time({
>>> for(i in words.to.lookup)
>>> wfreq(lex,as.character(i))
>>> },gcFirst=TRUE)
>>>
>>> ***
>>>
>>> I'm guessing that the problem is that the indexing of the
>>> wfreqs vector in my wfreq() function is not happening inside
>>> the actual lexicon's environment. However, I have not been
>>> able to figure out the proper call to get the lookup to
>>> happen inside the lexicon's environment. I've tried
>>>
>>> wfreq1 <- function(lexicon,word) {
>>> return(eval(wfreqs[word],envir=lexicon at e))
>>> }
>>>
>>> which I'd thought should work, but this gives me an error:
>>>
>>>> wfreq1(my.lexicon,'the')
>>> Error in eval(wfreqs[word], envir = lexicon at e) :
>>> object 'wfreqs' not found
>>>
>>> Any advice would be much appreciated!
>>>
>>> Best & many thanks in advance,
>>>
>>> Roger
>>>
>>> --
>>>
>>> Roger Levy Email: rlevy at ucsd.edu
>>> Assistant Professor Phone: 858-534-7219
>>> Department of Linguistics Fax: 858-534-4789
>>> UC San Diego Web: http://ling.ucsd.edu/~rlevy
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
--
Roger Levy Email: rlevy at ucsd.edu
Assistant Professor Phone: 858-534-7219
Department of Linguistics Fax: 858-534-4789
UC San Diego Web: http://ling.ucsd.edu/~rlevy
More information about the R-help
mailing list