[R] Question on Stopword Removal from a Cyrillic (Bulgarian)Text
nalimilan at club.fr
Tue Apr 9 21:55:42 CEST 2013
Le mardi 09 avril 2013 à 10:10 +0300, Ventseslav Kozarev, MPP a écrit :
> I bumped into a serious issue while trying to analyse some texts in
> Bulgarian language (with the tm package). I import a tab-separated csv
> file, which holds a total of 22 variables, most of which are text cells
> (not factors), using the read.delim function:
> (I also tried the above with UTF-8 encoding on a UTF-8-saved file.)
> I have my list of stop words written in a separate text file, one word
> per line, which I read into R using the scan function:
> (also tried with UTF-8 here on a correspondingly encoded file)
> I currently only test with a corpus based on the contents of just one
> variable, and I construct the corpus from a VectorSource. When I run
> inspect, all seems fine and I can see the text properly, with unicode
> characters present:
> However, no matter what I do - like which encoding I select - UTF-8 or
> CP1251, which is the typical code page for Bulgarian texts, I cannot get
> to remove the stop words from my corpus. The issue is present in both
> Linux and Windows, and across the computers I use R on, and I don't
> think it is related to bad configuration. Removal of punctuation, white
> space, and numbers is flawless, but the inability to remove stop words
> prevents me from further analysing the texts.
> Has somebody had experience with languages other than English, and for
> which there is no predefined stop list available through the stopwords
> function? I will highly appreciate any tips and advice!
Well, at least show us the code that you use to remove stopwords... Can
you provide a reproducible example with a toy corpus?
> Thanks in advance,
> R-help at r-project.org mailing list
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help