[R] Memory usage in R grows considerably while calculating word frequencies
arun
smartpink111 at yahoo.com
Tue Sep 25 20:28:17 CEST 2012
Dear Martin,
Thanks for testing the code. You are right.
I modified the code:
If I test it for a sample text,
txt1<-"Romney A.K. different, (= than other people. Is it?"
OP's code:
pattern <- "(\\b[A-Za-z]+\\b)"
match <- gregexpr(pattern,txt1)
words.txt <- regmatches(txt1,match)
words.txt<-unlist(words.txt)
words.txt<-table(words.txt,dnn="words")
words.txt<-sort(words.txt,decreasing=TRUE)
words.txt
#words
# A different Is it K other people Romney
# 1 1 1 1 1 1 1 1
# than
# 1
#My code:
words.txt1<-sort(table(gsub("\\W","",unlist(strsplit(tolower(txt1),"\\s")))[grepl("\\b\\w+\\b",gsub("\\W","",unlist(strsplit(tolower(txt1),"\\s"))))]))
# ak different is it other people romney than
# 1 1 1 1 1 1 1 1
Here, as you can see, OP's code split A.K. to two words, but my code joins it. I didn't fix it because the concern is to minimize memory usage.
I again, tested the new code with text of :
sapply(strsplit(txt1," "),length)
#[1] 4850 9072 6400 2071
sum(sapply(strsplit(txt1," "),length))
#[1] 22393
: words.
#OP's code:
system.time({
txt1<-tolower(scan("text_file","character",sep="\n"))
pattern <- "(\\b[A-Za-z]+\\b)"
match <- gregexpr(pattern,txt1)
words.txt <- regmatches(txt1,match)
words.txt<-unlist(words.txt)
words.txt<-table(words.txt,dnn="words")
words.txt<-sort(words.txt,decreasing=TRUE)
words.txt<-paste(names(words.txt),words.txt,sep="\t")
cat("Word\tFREQ",words.txt,file="frequencies",sep="\n")
})
#Read 4 items
# user system elapsed
# 12.056 0.000 12.066
#My code:
system.time({
txt1<-tolower(scan("text_file","character",sep="\n"))
words.txt<-sort(table(gsub("\\W","",unlist(strsplit(tolower(txt1),"\\s")))[grepl("\\b\\w+\\b",gsub("\\W","",unlist(strsplit(tolower(txt1),"\\s"))))]),decreasing=TRUE)
words.txt<-paste(names(words.txt),words.txt,sep="\t")
cat("Word\tFREQ",words.txt,file="frequencies",sep="\n")
})
#Read 4 items
# user system elapsed
# 0.148 0.000 0.150
There is improvement in the speed. Output also looked similar. This code may be still improved.
A.K.
----- Original Message -----
From: Martin Maechler <maechler at stat.math.ethz.ch>
To: arun <smartpink111 at yahoo.com>
Cc: mcelis <mcelis at lightminersystems.com>; R help <r-help at r-project.org>
Sent: Tuesday, September 25, 2012 9:07 AM
Subject: Re: [R] Memory usage in R grows considerably while calculating word frequencies
>>>>> arun <smartpink111 at yahoo.com>
>>>>> on Mon, 24 Sep 2012 19:59:35 -0700 writes:
> HI,
> In the previous email, I forgot to add unlist().
> With four paragraphs,
> sapply(strsplit(txt1," "),length)
> #[1] 4850 9072 6400 2071
> #Your code:
> system.time({
> txt1<-tolower(scan("text_file","character",sep="\n"))
> pattern <- "(\\b[A-Za-z]+\\b)"
> match <- gregexpr(pattern,txt1)
> words.txt <- regmatches(txt1,match)
> words.txt<-unlist(words.txt)
> words.txt<-table(words.txt,dnn="words")
> words.txt<-sort(words.txt,decreasing=TRUE)
> words.txt<-paste(names(words.txt),words.txt,sep="\t")
> cat("Word\tFREQ",words.txt,file="frequencies",sep="\n")
> })
> #Read 4 items
> # user system elapsed
> # 11.781 0.004 11.799
> #Modified code:
> system.time({
> txt1<-tolower(scan("text_file","character",sep="\n"))
> words.txt<-sort(table(unlist(strsplit(tolower(txt1),"\\s"))),decreasing=TRUE)
> words.txt<-paste(names(words.txt),words.txt,sep="\t")
> cat("Word\tFREQ",words.txt,file="frequencies",sep="\n")
> })
> #Read 4 items
> #user system elapsed
> # 0.036 0.008 0.043
> A.K.
Well, dear A.K., your definition of "word" is really different,
and in my view clearly much too simplistic, compared to what the
OP (= original-poster) asked from.
E.g., from the above paragraph, your method will get words such as
"A.K.," "different," or "(="
clearly wrongly.
Martin Maechler, ETH Zurich
> ----- Original Message -----
> From: mcelis <mcelis at lightminersystems.com>
> To: r-help at r-project.org
> Cc:
> Sent: Monday, September 24, 2012 7:29 PM
> Subject: [R] Memory usage in R grows considerably while calculating word frequencies
> I am working with some large text files (up to 16 GBytes). I am interested
> in extracting the words and counting each time each word appears in the
> text. I have written a very simple R program by following some suggestions
> and examples I found online.
> If my input file is 1 GByte, I see that R uses up to 11 GBytes of memory
> when executing the program on
> a 64-bit system running CentOS 6.3. Why is R using so much memory? Is there
> a better way to do this that will
> minimize memory usage.
> I am very new to R, so I would appreciate some tips on how to improve my
> program or a better way to do it.
> R program:
> # Read in the entire file and convert all words in text to lower case
> words.txt<-tolower(scan("text_file","character",sep="\n"))
> # Extract words
> pattern <- "(\\b[A-Za-z]+\\b)"
> match <- gregexpr(pattern,words.txt)
> words.txt <- regmatches(words.txt,match)
> # Create a vector from the list of words
> words.txt<-unlist(words.txt)
> # Calculate word frequencies
> words.txt<-table(words.txt,dnn="words")
> # Sort by frequency, not alphabetically
> words.txt<-sort(words.txt,decreasing=TRUE)
> # Put into some readable form, "Name of word" and "Number of times it
> occurs"
> words.txt<-paste(names(words.txt),words.txt,sep="\t")
> # Results to a file
> cat("Word\tFREQ",words.txt,file="frequencies",sep="\n")
> --
> View this message in context: http://r.789695.n4.nabble.com/Memory-usage-in-R-grows-considerably-while-calculating-word-frequencies-tp4644053.html
> Sent from the R help mailing list archive at Nabble.com.
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list