[R] How can this code be improved?

Fri Nov 13 16:26:19 CET 2009

Jim, Dennis,

Once again, thanks for all your suggestions.  After developing a more R-like
version of the script I terminated the running one after 976 (of 1697) reports
had been processed.  At that point, the script had been running for approx.
33.5 hours!  Here is the new version:

library(filehash)
db <- dbInit("/Volumes/Work on RDR Test Documents/R Databases/DB_TXT", type =
"RDS")
dbLoad(db)
dba <- dbInit("/Volumes/Work on RDR Test Documents/R Databases/DB_Aux", type =
"RDS")
dbLoad(dba)

tokens <- sentences.all.tokenized
stopwords <- stopwords.pubmed

# Convert to lowercase, remove beginning and end punctuation, tabulate
my.func <- function(sent, stop, ...){
	list(
		freq.table = (temp.table <- table(
			sub(
				"[[:punct:]]*$", "", sub(
					"^[[:punct:]]*", "", tolower(sent)
				)
			)
		)),
		stopword.matches = (temp.matches <- match(names(temp.table), stop)),
		stopword.summary = array(tapply(temp.table, !is.na(temp.matches), sum), dim
= 2, dimnames = list(c("no.non.stopwords", "no.stopwords")))
	)
}

cat("Beginning at ", date(), ".\n", sep = "")
token.tables <- 
	lapply(1:length(tokens),
		function(i.d, doc, stop, func, ...){
			if ((i.d - 1) %% 10 == 0) cat((i.d - 1), " report(s) completed at ",
date(), ".\n", sep = "")
			lapply(1:length(doc[[i.d]]),
				function(i.s, sent, stop, func, ...){
					func(sent[[i.s]], stop, ...)
				}
				, sent = doc[[i.d]], stop = stop, func = func, ...
			)
		}
		,
		doc = tokens, stop = stopwords, func = my.func
	)
cat("Terminating at ", date(), ".\n", sep = "")

This script reaches the same point in approx. 1:09 hours, a little under 70
minutes!

What I am noticing now is a severe lack of real memory.  Activity Monitor
shows about 20MB of real memory free.  R, running in 64-bit mode, is using
6.75GB of real and 10GB of virtual memory.  I see lots of disk activity.  This
is undoubtedly the swapping between real and virtual memory.  CPU activity is
very low.  I suppose I could run the script twice, each time on half the
tokens.  That would give me two lists, which I would have to combine into a
single one.

Regards,
Richard

On Thu, 12 Nov 2009 18:53:34 -0500, jim holtman wrote
> Run the script on a small subset of the data and use Rprof to profile
> the code.  This will give you an idea of where time is being spent 
> and where to focus for improvement.  I would suggest that you do not 
> convert the output of the 'table(t)' do a dataframe.  You can just 
> extract the 'names' to get the words.  You might be spending some of 
> the time in the accessing the information in the dataframe, which is 
> really not necessary for your code.
> 
> On Thu, Nov 12, 2009 at 2:12 AM, Richard R. Liu <richard.liu at pueo-
> owl.ch> wrote:
> > I am running the following code on a MacBook Pro 17" Unibody early 2009 with
> > 8GB RAM, OS X 10.5.8, R 2.10.0 Patch from Nov. 2, 2009, in 64-bit mode.
> >
> > freq.stopwords <- numeric(0)
> > freq.nonstopwords <- numeric(0)
> > token.tables <- list(0)
> > i.ss <- c(0)
> > cat("Beginning at ", date(), ".\n")
> > for (i.d in 1:length(tokens)) {
> >        tt <- list(0)
> >        for (i.s in 1:length(tokens[[i.d]])) {
> >                t <- tolower(tokens[[i.d]][[i.s]])
> >                t <- sub("^[[:punct:]]*", "", t)
> >                t <- sub("[[:punct:]]*$", "", t)
> >                t <- as.data.frame(table(t))
> >                i.m <- match(t$t, stopwords)
> >                i.m.is.na <- is.na(i.m)
> >                i.ss <- i.ss + 1
> >                freq.stopwords[i.ss] <- sum(t$Freq * !i.m.is.na)
> >                freq.nonstopwords[i.ss] <- sum(t$Freq * i.m.is.na)
> >                tt[[i.s]] <- data.frame(token = t$t, freq = t$Freq,
> > matches.stopword = i.m)
> >        }
> >        token.tables[[i.d]] <- tt
> >        if (i.d %% 5 == 0) cat(i.d, "reports completed at ", date(), ".\n")
> > }
> > cat("Terminating at ", date(), ".\n")
> >
> > The object in the innermost loop are:
> > * tokens:  a list of lists.  In the expression tokens[[i.d]][[i.s]], the
> > first index runs over 1697 reports, the second over the sentences in the
> > report, each of which consists of a vector of tokens, i.e., the character
> > strings between the white spaces in the sentence.  One of the largest
> > reports takes up 58MB on the harddisk.  Thus, the number of sentences can be
> > quite large, and some of the sentences are quite long (measure in tokens as
> > well as in characters).
> > * stopwords:  is a vector of 571 words that occur very often in written
> > English.
> >
> > The code operates on sentences, converting each token in the sentence to
> > lowercase, removing punctuation at the beginning and end of the token,
> > tabulating the frequency of the unique tokens, and generating an array that
> > indicates which tokens correspond to stopwords.  It also sums the
> > frequencies of the stopwords and that of the non-stopwords.  The result is a
> > list of list of dataframes.
> >
> > I began running on Thursday Nov. 12, 2009 at 01:56:36.  As of 7:52:00 510
> > reports had been processed.  The Activity Monitor indicates no memory
> > bottleneck.  R is using 4.31 GB of real memory, 7.23 GB of virtual memory,
> > and 1.67 GB of real memory are free.
> >
> > I admit that I am an R newbie.  From my understanding of the "apply"
> > functions (e.g., lapply), I see no way to use them to simplify the loops.  I
> > would appreciate any suggestions about making the code more "R-like" and,
> > above all, much faster.
> >
> > Regards,
> > Richard
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
> >
> 
> -- 
> Jim Holtman
> Cincinnati, OH
> +1 513 646 9390
> 
> What is the problem that you are trying to solve?

--
Richard R. Liu
Dittingerstr. 33
CH-4053 Basel
Switzerland

Tel.:  +41 61 331 10 47
Email:  richard.liu at pueo-owl.ch