[R] Question on Stopword Removal from a Cyrillic (Bulgarian)Text

Wed Apr 10 09:50:34 CEST 2013

Hi,

Thanks for taking the time. Here is a more reproducible example of the 
entire process:

# Creating a vector source - stupid text in the Bulgarian language
bg<-c('Днес е хубав и слънчев ден, в който всички искат да бъдат 
навън.','Утре ще бъде още по-хубав ден.')

# Converting strings from the vector source to UTF-8. Without this step
# in my setup, I don't see Cyrillic letters, even if I set the default
# code page to CP1251.
bg<-iconv(bg,to='UTF-8')

# Load the tm library
library(tm)

# Create the corpus from the vector source
corp<-Corpus(VectorSource(bg),readerControl=list(language='bulgarian'))

# Create a custom stop list based on the example vector source
# Converting to UTF-8
stoplist<-c('е','и','в','който','всички','да','бъдат','навън','ще','бъде','още')
stoplist<-iconv(stoplist,to='UTF-8')

# Preprocessing
corp<-tm_map(corp,removePunctuation)
corp<-tm_map(corp,removeNumbers)
corp<-tm_map(corp,tolower)
corp<-tm_map(corp,removeWords,stoplist)

# End of code here

Now, if I run inspect(corp), I still see all the stop words intact 
inside the corpus. I can't figure out why. I tried experimenting with 
file encodings, with and without explicit statements of encoding, and it 
never works. As far as I can tell, my code is not wrong, and the 
function stopwords('language') returns a character vector, so just 
replacing it by a different character vector should do the trick. Alas, 
no list of stop words for Bulgarian language is available as part of the 
tm package (not surprisingly).

In the above example, I also tried to read in the list of stop words 
from a file using the scan function, per the example in my original 
message. It also fails to remove stop words, without any warnings or 
error messages.

An alternative I tried was to convert to a term-document matrix, and 
then loop through the words inside and remove those that are also on the 
stop list. That only partially works for two reasons. The TDM is 
actually a list, and I am not sure what code I need to use if I delete 
words, but do not update the underlying indeces. And second, some of the 
words still don't get removed even though they are in the list. But that 
is another issue altogether...

Thanks for your attention and for your help!
Vince

On 9.4.2013 г. 22:55 ч., Milan Bouchet-Valat wrote:
> Le mardi 09 avril 2013 à 10:10 +0300, Ventseslav Kozarev, MPP a écrit :
>> Hi,
>>
>> I bumped into a serious issue while trying to analyse some texts in
>> Bulgarian language (with the tm package). I import a tab-separated csv
>> file, which holds a total of 22 variables, most of which are text cells
>> (not factors), using the read.delim function:
>>
>> data<-read.delim("bigcompanies_ascii.csv",
>>                   header=TRUE,
>>                   quote="'",
>>                   sep="\t",
>>                   as.is=TRUE,
>>                   encoding='CP1251',
>>                   fileEncoding='CP1251')
>>
>> (I also tried the above with UTF-8 encoding on a UTF-8-saved file.)
>>
>> I have my list of stop words written in a separate text file, one word
>> per line, which I read into R using the scan function:
>>
>> stoplist<-scan(file='stoplist_ascii.txt',
>>                  what='character',
>>                  strip.white=TRUE,
>>                  blank.lines.skip=TRUE,
>>                  fileEncoding='CP1251',
>>                  encoding='CP1251')
>>
>> (also tried with UTF-8 here on a correspondingly encoded file)
>>
>> I currently only test with a corpus based on the contents of just one
>> variable, and I construct the corpus from a VectorSource. When I run
>> inspect, all seems fine and I can see the text properly, with unicode
>> characters present:
>>
>> data.corpus<-Corpus(VectorSource(data$variable,encoding='UTF-8'),
>>                      readerControl=list(language='bulgarian'))
>>
>> However, no matter what I do - like which encoding I select - UTF-8 or
>> CP1251, which is the typical code page for Bulgarian texts, I cannot get
>> to remove the stop words from my corpus. The issue is present in both
>> Linux and Windows, and across the computers I use R on, and I don't
>> think it is related to bad configuration. Removal of punctuation, white
>> space, and numbers is flawless, but the inability to remove stop words
>> prevents me from further analysing the texts.
>>
>> Has somebody had experience with languages other than English, and for
>> which there is no predefined stop list available through the stopwords
>> function? I will highly appreciate any tips and advice!
> Well, at least show us the code that you use to remove stopwords... Can
> you provide a reproducible example with a toy corpus?
>
>> Thanks in advance,
>> Vince
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>