[R] Memory usage in R grows considerably while calculating word frequencies

Tue Sep 25 04:59:35 CEST 2012

HI,

In the previous email, I forgot to add unlist().
With four paragraphs,
sapply(strsplit(txt1," "),length)
#[1] 4850 9072 6400 2071

#Your code:
system.time({
txt1<-tolower(scan("text_file","character",sep="\n")) 
pattern <- "(\\b[A-Za-z]+\\b)"
match <- gregexpr(pattern,txt1)
words.txt <- regmatches(txt1,match)
words.txt<-unlist(words.txt)
words.txt<-table(words.txt,dnn="words")
words.txt<-sort(words.txt,decreasing=TRUE)
words.txt<-paste(names(words.txt),words.txt,sep="\t")
cat("Word\tFREQ",words.txt,file="frequencies",sep="\n")
})

#Read 4 items
#   user  system elapsed 
# 11.781   0.004  11.799 

#Modified code:
system.time({
txt1<-tolower(scan("text_file","character",sep="\n")) 
 words.txt<-sort(table(unlist(strsplit(tolower(txt1),"\\s"))),decreasing=TRUE)
 words.txt<-paste(names(words.txt),words.txt,sep="\t")
 cat("Word\tFREQ",words.txt,file="frequencies",sep="\n") 
})
#Read 4 items
 #user  system elapsed 
 # 0.036   0.008   0.043 

A.K.

----- Original Message -----
From: mcelis <mcelis at lightminersystems.com>
To: r-help at r-project.org
Cc: 
Sent: Monday, September 24, 2012 7:29 PM
Subject: [R] Memory usage in R grows considerably while calculating word frequencies

I am working with some large text files (up to 16 GBytes).  I am interested
in extracting the words and counting each time each word appears in the
text. I have written a very simple R program by following some suggestions
and examples I found online.  

If my input file is 1 GByte, I see that R uses up to 11 GBytes of memory
when executing the program on
a 64-bit system running CentOS 6.3. Why is R using so much memory? Is there
a better way to do this that will
minimize memory usage.

I am very new to R, so I would appreciate some tips on how to improve my
program or a better way to do it.

R program:
# Read in the entire file and convert all words in text to lower case
words.txt<-tolower(scan("text_file","character",sep="\n"))

# Extract words
pattern <- "(\\b[A-Za-z]+\\b)"
match <- gregexpr(pattern,words.txt)
words.txt <- regmatches(words.txt,match)

# Create a vector from the list of words
words.txt<-unlist(words.txt)

# Calculate word frequencies
words.txt<-table(words.txt,dnn="words")

# Sort by frequency, not alphabetically
words.txt<-sort(words.txt,decreasing=TRUE)

# Put into some readable form, "Name of word" and "Number of times it
occurs"
words.txt<-paste(names(words.txt),words.txt,sep="\t")

# Results to a file
cat("Word\tFREQ",words.txt,file="frequencies",sep="\n")

--
View this message in context: http://r.789695.n4.nabble.com/Memory-usage-in-R-grows-considerably-while-calculating-word-frequencies-tp4644053.html
Sent from the R help mailing list archive at Nabble.com.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.