[R] Question on Stopword Removal from a Cyrillic (Bulgarian)Text
Ventseslav Kozarev, MPP
vinceeval at gmail.com
Tue Apr 9 09:10:26 CEST 2013
Hi,
I bumped into a serious issue while trying to analyse some texts in
Bulgarian language (with the tm package). I import a tab-separated csv
file, which holds a total of 22 variables, most of which are text cells
(not factors), using the read.delim function:
data<-read.delim("bigcompanies_ascii.csv",
header=TRUE,
quote="'",
sep="\t",
as.is=TRUE,
encoding='CP1251',
fileEncoding='CP1251')
(I also tried the above with UTF-8 encoding on a UTF-8-saved file.)
I have my list of stop words written in a separate text file, one word
per line, which I read into R using the scan function:
stoplist<-scan(file='stoplist_ascii.txt',
what='character',
strip.white=TRUE,
blank.lines.skip=TRUE,
fileEncoding='CP1251',
encoding='CP1251')
(also tried with UTF-8 here on a correspondingly encoded file)
I currently only test with a corpus based on the contents of just one
variable, and I construct the corpus from a VectorSource. When I run
inspect, all seems fine and I can see the text properly, with unicode
characters present:
data.corpus<-Corpus(VectorSource(data$variable,encoding='UTF-8'),
readerControl=list(language='bulgarian'))
However, no matter what I do - like which encoding I select - UTF-8 or
CP1251, which is the typical code page for Bulgarian texts, I cannot get
to remove the stop words from my corpus. The issue is present in both
Linux and Windows, and across the computers I use R on, and I don't
think it is related to bad configuration. Removal of punctuation, white
space, and numbers is flawless, but the inability to remove stop words
prevents me from further analysing the texts.
Has somebody had experience with languages other than English, and for
which there is no predefined stop list available through the stopwords
function? I will highly appreciate any tips and advice!
Thanks in advance,
Vince
More information about the R-help
mailing list