[R] Memory usage in R grows considerably while calculating word frequencies

Milan Bouchet-Valat nalimilan at club.fr
Tue Sep 25 14:08:58 CEST 2012


Le lundi 24 septembre 2012 à 16:29 -0700, mcelis a écrit :
> I am working with some large text files (up to 16 GBytes).  I am interested
> in extracting the words and counting each time each word appears in the
> text. I have written a very simple R program by following some suggestions
> and examples I found online.  
> 
> If my input file is 1 GByte, I see that R uses up to 11 GBytes of memory
> when executing the program on
> a 64-bit system running CentOS 6.3. Why is R using so much memory? Is there
> a better way to do this that will
> minimize memory usage.
> 
> I am very new to R, so I would appreciate some tips on how to improve my
> program or a better way to do it.
First, I think you should have a look at the tm package by Ingo
Feinerer. It will help you to import the texts, optionally run
processing steps on it, and then extract the words and create a
document-term matrix counting their frequencies. No need to reinvent the
wheel.

Second, there's nothing wrong with using RAM as long as it's available.
If other programs need it, the Linux will reclaim it. There's a problem
only if R's memory use does not reduce at that point. Use gc() to check
whether the RAM allocated to R is really in use. But tm should improve
the efficiency of the computations.


My two cents

> R program:
> # Read in the entire file and convert all words in text to lower case
> words.txt<-tolower(scan("text_file","character",sep="\n"))
> 
> # Extract words
> pattern <- "(\\b[A-Za-z]+\\b)"
> match <- gregexpr(pattern,words.txt)
> words.txt <- regmatches(words.txt,match)
> 
> # Create a vector from the list of words
> words.txt<-unlist(words.txt)
> 
> # Calculate word frequencies
> words.txt<-table(words.txt,dnn="words")
> 
> # Sort by frequency, not alphabetically
> words.txt<-sort(words.txt,decreasing=TRUE)
> 
> # Put into some readable form, "Name of word" and "Number of times it
> occurs"
> words.txt<-paste(names(words.txt),words.txt,sep="\t")
> 
> # Results to a file
> cat("Word\tFREQ",words.txt,file="frequencies",sep="\n")
> 
> 
> 
> --
> View this message in context: http://r.789695.n4.nabble.com/Memory-usage-in-R-grows-considerably-while-calculating-word-frequencies-tp4644053.html
> Sent from the R help mailing list archive at Nabble.com.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list