[R] Question on Stopword Removal from a Cyrillic (Bulgarian)Text

Wed Apr 10 10:42:26 CEST 2013

Thank you so much! You made it look (almost) so easy. I greatly 
appreciate it!

On 10.4.2013 г. 11:29 ч., Milan Bouchet-Valat wrote:
> Le mercredi 10 avril 2013 à 10:50 +0300, Ventseslav Kozarev, MPP a
> écrit :
>> Hi,
>>
>> Thanks for taking the time. Here is a more reproducible example of the
>> entire process:
>>
>> # Creating a vector source - stupid text in the Bulgarian language
>> bg<-c('Днес е хубав и слънчев ден, в който всички искат да бъдат
>> навън.','Утре ще бъде още по-хубав ден.')
>>
>> # Converting strings from the vector source to UTF-8. Without this step
>> # in my setup, I don't see Cyrillic letters, even if I set the default
>> # code page to CP1251.
>> bg<-iconv(bg,to='UTF-8')
>>
>> # Load the tm library
>> library(tm)
>>
>> # Create the corpus from the vector source
>> corp<-Corpus(VectorSource(bg),readerControl=list(language='bulgarian'))
>>
>> # Create a custom stop list based on the example vector source
>> # Converting to UTF-8
>> stoplist<-c('е','и','в','който','всички','да','бъдат','навън','ще','бъде','още')
>> stoplist<-iconv(stoplist,to='UTF-8')
>>
>> # Preprocessing
>> corp<-tm_map(corp,removePunctuation)
>> corp<-tm_map(corp,removeNumbers)
>> corp<-tm_map(corp,tolower)
>> corp<-tm_map(corp,removeWords,stoplist)
>>
>> # End of code here
>>
>> Now, if I run inspect(corp), I still see all the stop words intact
>> inside the corpus. I can't figure out why. I tried experimenting with
>> file encodings, with and without explicit statements of encoding, and it
>> never works. As far as I can tell, my code is not wrong, and the
>> function stopwords('language') returns a character vector, so just
>> replacing it by a different character vector should do the trick. Alas,
>> no list of stop words for Bulgarian language is available as part of the
>> tm package (not surprisingly).
>>
>> In the above example, I also tried to read in the list of stop words
>> from a file using the scan function, per the example in my original
>> message. It also fails to remove stop words, without any warnings or
>> error messages.
>>
>> An alternative I tried was to convert to a term-document matrix, and
>> then loop through the words inside and remove those that are also on the
>> stop list. That only partially works for two reasons. The TDM is
>> actually a list, and I am not sure what code I need to use if I delete
>> words, but do not update the underlying indeces. And second, some of the
>> words still don't get removed even though they are in the list. But that
>> is another issue altogether...
>>
>> Thanks for your attention and for your help!
>> Vince
> Thanks for the reproducible example. Indeed, it does not work here
> either (Linux with UTF-8 locale). The problem seems to be in the call to
> gsub() in removeWords: the pattern "\\b" does not match anything when
> perl=TRUE. With perl=FALSE, it works.
>
> gsub("днес", "", "днес е хубав")
> # [1] " е хубав"
> gsub("днес", "", "днес е хубав", perl=TRUE)
> # [1] " е хубав"
> gsub("\\bднес\\b", "", "днес е хубав")
> # [1] " е хубав"
> gsub("\\bднес\\b", "", "днес е хубав", perl=TRUE)
> # [1] "днес е хубав"
>
> It looks like some non-ASCII characters like é or € are supported, but
> not others like œ or the Cyrillic characters you provided.
>
> For a temporary solution, you can define this function to replace the
> one provided by tm:
> removeWords.PlainTextDocument <- function (x, words)
>      gsub(sprintf("\\b(%s)\\b", paste(words, collapse = "|")), "", x)
>
> I have CCed tm's developer, Ingo Feinerer, to see if he has an idea to
> fix the problem in tm; but this looks like a bug in R (or in perl
> regexps).
>
>
> Regards
>
>> On 9.4.2013 г. 22:55 ч., Milan Bouchet-Valat wrote:
>>> Le mardi 09 avril 2013 à 10:10 +0300, Ventseslav Kozarev, MPP a écrit :
>>>> Hi,
>>>>
>>>> I bumped into a serious issue while trying to analyse some texts in
>>>> Bulgarian language (with the tm package). I import a tab-separated csv
>>>> file, which holds a total of 22 variables, most of which are text cells
>>>> (not factors), using the read.delim function:
>>>>
>>>> data<-read.delim("bigcompanies_ascii.csv",
>>>>                    header=TRUE,
>>>>                    quote="'",
>>>>                    sep="\t",
>>>>                    as.is=TRUE,
>>>>                    encoding='CP1251',
>>>>                    fileEncoding='CP1251')
>>>>
>>>> (I also tried the above with UTF-8 encoding on a UTF-8-saved file.)
>>>>
>>>> I have my list of stop words written in a separate text file, one word
>>>> per line, which I read into R using the scan function:
>>>>
>>>> stoplist<-scan(file='stoplist_ascii.txt',
>>>>                   what='character',
>>>>                   strip.white=TRUE,
>>>>                   blank.lines.skip=TRUE,
>>>>                   fileEncoding='CP1251',
>>>>                   encoding='CP1251')
>>>>
>>>> (also tried with UTF-8 here on a correspondingly encoded file)
>>>>
>>>> I currently only test with a corpus based on the contents of just one
>>>> variable, and I construct the corpus from a VectorSource. When I run
>>>> inspect, all seems fine and I can see the text properly, with unicode
>>>> characters present:
>>>>
>>>> data.corpus<-Corpus(VectorSource(data$variable,encoding='UTF-8'),
>>>>                       readerControl=list(language='bulgarian'))
>>>>
>>>> However, no matter what I do - like which encoding I select - UTF-8 or
>>>> CP1251, which is the typical code page for Bulgarian texts, I cannot get
>>>> to remove the stop words from my corpus. The issue is present in both
>>>> Linux and Windows, and across the computers I use R on, and I don't
>>>> think it is related to bad configuration. Removal of punctuation, white
>>>> space, and numbers is flawless, but the inability to remove stop words
>>>> prevents me from further analysing the texts.
>>>>
>>>> Has somebody had experience with languages other than English, and for
>>>> which there is no predefined stop list available through the stopwords
>>>> function? I will highly appreciate any tips and advice!
>>> Well, at least show us the code that you use to remove stopwords... Can
>>> you provide a reproducible example with a toy corpus?
>>>
>>>> Thanks in advance,
>>>> Vince
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>