[R] How can this code be improved?

Richard R. Liu richard.liu at pueo-owl.ch
Fri Nov 13 01:18:16 CET 2009


Jim and Dennis,

Thanks for your suggestions.  Almost 24 hours later, the script has  
finished a bit more than half the reports.  Free RAM varies between  
1.2GB and a few MB.  I hesitate to interrupt it in order to implement  
the improvements that you have suggested, lest they do not decrease  
the execution time by at least an order of magnitude; however, I  
definitely will implement and test your and my improvements.

Regards,
Richard

On Nov 13, 2009, at 0:53 , jim holtman wrote:

> Run the script on a small subset of the data and use Rprof to profile
> the code.  This will give you an idea of where time is being spent and
> where to focus for improvement.  I would suggest that you do not
> convert the output of the 'table(t)' do a dataframe.  You can just
> extract the 'names' to get the words.  You might be spending some of
> the time in the accessing the information in the dataframe, which is
> really not necessary for your code.
>
> On Thu, Nov 12, 2009 at 2:12 AM, Richard R. Liu <richard.liu at pueo-owl.ch 
> > wrote:
>> I am running the following code on a MacBook Pro 17" Unibody early  
>> 2009 with
>> 8GB RAM, OS X 10.5.8, R 2.10.0 Patch from Nov. 2, 2009, in 64-bit  
>> mode.
>>
>> freq.stopwords <- numeric(0)
>> freq.nonstopwords <- numeric(0)
>> token.tables <- list(0)
>> i.ss <- c(0)
>> cat("Beginning at ", date(), ".\n")
>> for (i.d in 1:length(tokens)) {
>>        tt <- list(0)
>>        for (i.s in 1:length(tokens[[i.d]])) {
>>                t <- tolower(tokens[[i.d]][[i.s]])
>>                t <- sub("^[[:punct:]]*", "", t)
>>                t <- sub("[[:punct:]]*$", "", t)
>>                t <- as.data.frame(table(t))
>>                i.m <- match(t$t, stopwords)
>>                i.m.is.na <- is.na(i.m)
>>                i.ss <- i.ss + 1
>>                freq.stopwords[i.ss] <- sum(t$Freq * !i.m.is.na)
>>                freq.nonstopwords[i.ss] <- sum(t$Freq * i.m.is.na)
>>                tt[[i.s]] <- data.frame(token = t$t, freq = t$Freq,
>> matches.stopword = i.m)
>>        }
>>        token.tables[[i.d]] <- tt
>>        if (i.d %% 5 == 0) cat(i.d, "reports completed at ", date(),  
>> ".\n")
>> }
>> cat("Terminating at ", date(), ".\n")
>>
>> The object in the innermost loop are:
>> * tokens:  a list of lists.  In the expression tokens[[i.d]] 
>> [[i.s]], the
>> first index runs over 1697 reports, the second over the sentences  
>> in the
>> report, each of which consists of a vector of tokens, i.e., the  
>> character
>> strings between the white spaces in the sentence.  One of the largest
>> reports takes up 58MB on the harddisk.  Thus, the number of  
>> sentences can be
>> quite large, and some of the sentences are quite long (measure in  
>> tokens as
>> well as in characters).
>> * stopwords:  is a vector of 571 words that occur very often in  
>> written
>> English.
>>
>> The code operates on sentences, converting each token in the  
>> sentence to
>> lowercase, removing punctuation at the beginning and end of the  
>> token,
>> tabulating the frequency of the unique tokens, and generating an  
>> array that
>> indicates which tokens correspond to stopwords.  It also sums the
>> frequencies of the stopwords and that of the non-stopwords.  The  
>> result is a
>> list of list of dataframes.
>>
>> I began running on Thursday Nov. 12, 2009 at 01:56:36.  As of  
>> 7:52:00 510
>> reports had been processed.  The Activity Monitor indicates no memory
>> bottleneck.  R is using 4.31 GB of real memory, 7.23 GB of virtual  
>> memory,
>> and 1.67 GB of real memory are free.
>>
>> I admit that I am an R newbie.  From my understanding of the "apply"
>> functions (e.g., lapply), I see no way to use them to simplify the  
>> loops.  I
>> would appreciate any suggestions about making the code more "R- 
>> like" and,
>> above all, much faster.
>>
>> Regards,
>> Richard
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>
>
>
> -- 
> Jim Holtman
> Cincinnati, OH
> +1 513 646 9390
>
> What is the problem that you are trying to solve?



More information about the R-help mailing list