[R] How can this code be improved?
Richard R. Liu
richard.liu at pueo-owl.ch
Fri Nov 13 16:26:19 CET 2009
Jim, Dennis,
Once again, thanks for all your suggestions. After developing a more R-like
version of the script I terminated the running one after 976 (of 1697) reports
had been processed. At that point, the script had been running for approx.
33.5 hours! Here is the new version:
library(filehash)
db <- dbInit("/Volumes/Work on RDR Test Documents/R Databases/DB_TXT", type =
"RDS")
dbLoad(db)
dba <- dbInit("/Volumes/Work on RDR Test Documents/R Databases/DB_Aux", type =
"RDS")
dbLoad(dba)
tokens <- sentences.all.tokenized
stopwords <- stopwords.pubmed
# Convert to lowercase, remove beginning and end punctuation, tabulate
my.func <- function(sent, stop, ...){
list(
freq.table = (temp.table <- table(
sub(
"[[:punct:]]*$", "", sub(
"^[[:punct:]]*", "", tolower(sent)
)
)
)),
stopword.matches = (temp.matches <- match(names(temp.table), stop)),
stopword.summary = array(tapply(temp.table, !is.na(temp.matches), sum), dim
= 2, dimnames = list(c("no.non.stopwords", "no.stopwords")))
)
}
cat("Beginning at ", date(), ".\n", sep = "")
token.tables <-
lapply(1:length(tokens),
function(i.d, doc, stop, func, ...){
if ((i.d - 1) %% 10 == 0) cat((i.d - 1), " report(s) completed at ",
date(), ".\n", sep = "")
lapply(1:length(doc[[i.d]]),
function(i.s, sent, stop, func, ...){
func(sent[[i.s]], stop, ...)
}
, sent = doc[[i.d]], stop = stop, func = func, ...
)
}
,
doc = tokens, stop = stopwords, func = my.func
)
cat("Terminating at ", date(), ".\n", sep = "")
This script reaches the same point in approx. 1:09 hours, a little under 70
minutes!
What I am noticing now is a severe lack of real memory. Activity Monitor
shows about 20MB of real memory free. R, running in 64-bit mode, is using
6.75GB of real and 10GB of virtual memory. I see lots of disk activity. This
is undoubtedly the swapping between real and virtual memory. CPU activity is
very low. I suppose I could run the script twice, each time on half the
tokens. That would give me two lists, which I would have to combine into a
single one.
Regards,
Richard
On Thu, 12 Nov 2009 18:53:34 -0500, jim holtman wrote
> Run the script on a small subset of the data and use Rprof to profile
> the code. This will give you an idea of where time is being spent
> and where to focus for improvement. I would suggest that you do not
> convert the output of the 'table(t)' do a dataframe. You can just
> extract the 'names' to get the words. You might be spending some of
> the time in the accessing the information in the dataframe, which is
> really not necessary for your code.
>
> On Thu, Nov 12, 2009 at 2:12 AM, Richard R. Liu <richard.liu at pueo-
> owl.ch> wrote:
> > I am running the following code on a MacBook Pro 17" Unibody early 2009 with
> > 8GB RAM, OS X 10.5.8, R 2.10.0 Patch from Nov. 2, 2009, in 64-bit mode.
> >
> > freq.stopwords <- numeric(0)
> > freq.nonstopwords <- numeric(0)
> > token.tables <- list(0)
> > i.ss <- c(0)
> > cat("Beginning at ", date(), ".\n")
> > for (i.d in 1:length(tokens)) {
> > tt <- list(0)
> > for (i.s in 1:length(tokens[[i.d]])) {
> > t <- tolower(tokens[[i.d]][[i.s]])
> > t <- sub("^[[:punct:]]*", "", t)
> > t <- sub("[[:punct:]]*$", "", t)
> > t <- as.data.frame(table(t))
> > i.m <- match(t$t, stopwords)
> > i.m.is.na <- is.na(i.m)
> > i.ss <- i.ss + 1
> > freq.stopwords[i.ss] <- sum(t$Freq * !i.m.is.na)
> > freq.nonstopwords[i.ss] <- sum(t$Freq * i.m.is.na)
> > tt[[i.s]] <- data.frame(token = t$t, freq = t$Freq,
> > matches.stopword = i.m)
> > }
> > token.tables[[i.d]] <- tt
> > if (i.d %% 5 == 0) cat(i.d, "reports completed at ", date(), ".\n")
> > }
> > cat("Terminating at ", date(), ".\n")
> >
> > The object in the innermost loop are:
> > * tokens: a list of lists. In the expression tokens[[i.d]][[i.s]], the
> > first index runs over 1697 reports, the second over the sentences in the
> > report, each of which consists of a vector of tokens, i.e., the character
> > strings between the white spaces in the sentence. One of the largest
> > reports takes up 58MB on the harddisk. Thus, the number of sentences can be
> > quite large, and some of the sentences are quite long (measure in tokens as
> > well as in characters).
> > * stopwords: is a vector of 571 words that occur very often in written
> > English.
> >
> > The code operates on sentences, converting each token in the sentence to
> > lowercase, removing punctuation at the beginning and end of the token,
> > tabulating the frequency of the unique tokens, and generating an array that
> > indicates which tokens correspond to stopwords. It also sums the
> > frequencies of the stopwords and that of the non-stopwords. The result is a
> > list of list of dataframes.
> >
> > I began running on Thursday Nov. 12, 2009 at 01:56:36. As of 7:52:00 510
> > reports had been processed. The Activity Monitor indicates no memory
> > bottleneck. R is using 4.31 GB of real memory, 7.23 GB of virtual memory,
> > and 1.67 GB of real memory are free.
> >
> > I admit that I am an R newbie. From my understanding of the "apply"
> > functions (e.g., lapply), I see no way to use them to simplify the loops. I
> > would appreciate any suggestions about making the code more "R-like" and,
> > above all, much faster.
> >
> > Regards,
> > Richard
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
> >
>
> --
> Jim Holtman
> Cincinnati, OH
> +1 513 646 9390
>
> What is the problem that you are trying to solve?
--
Richard R. Liu
Dittingerstr. 33
CH-4053 Basel
Switzerland
Tel.: +41 61 331 10 47
Email: richard.liu at pueo-owl.ch
More information about the R-help
mailing list