[R] Question on Stopword Removal from a Cyrillic (Bulgarian)Text
Milan Bouchet-Valat
nalimilan at club.fr
Tue Apr 9 21:55:42 CEST 2013
Le mardi 09 avril 2013 à 10:10 +0300, Ventseslav Kozarev, MPP a écrit :
> Hi,
>
> I bumped into a serious issue while trying to analyse some texts in
> Bulgarian language (with the tm package). I import a tab-separated csv
> file, which holds a total of 22 variables, most of which are text cells
> (not factors), using the read.delim function:
>
> data<-read.delim("bigcompanies_ascii.csv",
> header=TRUE,
> quote="'",
> sep="\t",
> as.is=TRUE,
> encoding='CP1251',
> fileEncoding='CP1251')
>
> (I also tried the above with UTF-8 encoding on a UTF-8-saved file.)
>
> I have my list of stop words written in a separate text file, one word
> per line, which I read into R using the scan function:
>
> stoplist<-scan(file='stoplist_ascii.txt',
> what='character',
> strip.white=TRUE,
> blank.lines.skip=TRUE,
> fileEncoding='CP1251',
> encoding='CP1251')
>
> (also tried with UTF-8 here on a correspondingly encoded file)
>
> I currently only test with a corpus based on the contents of just one
> variable, and I construct the corpus from a VectorSource. When I run
> inspect, all seems fine and I can see the text properly, with unicode
> characters present:
>
> data.corpus<-Corpus(VectorSource(data$variable,encoding='UTF-8'),
> readerControl=list(language='bulgarian'))
>
> However, no matter what I do - like which encoding I select - UTF-8 or
> CP1251, which is the typical code page for Bulgarian texts, I cannot get
> to remove the stop words from my corpus. The issue is present in both
> Linux and Windows, and across the computers I use R on, and I don't
> think it is related to bad configuration. Removal of punctuation, white
> space, and numbers is flawless, but the inability to remove stop words
> prevents me from further analysing the texts.
>
> Has somebody had experience with languages other than English, and for
> which there is no predefined stop list available through the stopwords
> function? I will highly appreciate any tips and advice!
Well, at least show us the code that you use to remove stopwords... Can
you provide a reproducible example with a toy corpus?
> Thanks in advance,
> Vince
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list