[R] Question on Stopword Removal from a Cyrillic (Bulgarian)Text
Ventseslav Kozarev, MPP
vinceeval at gmail.com
Wed Apr 10 10:42:26 CEST 2013
Thank you so much! You made it look (almost) so easy. I greatly
On 10.4.2013 г. 11:29 ч., Milan Bouchet-Valat wrote:
> Le mercredi 10 avril 2013 à 10:50 +0300, Ventseslav Kozarev, MPP a
> écrit :
>> Thanks for taking the time. Here is a more reproducible example of the
>> entire process:
>> # Creating a vector source - stupid text in the Bulgarian language
>> bg<-c('Днес е хубав и слънчев ден, в който всички искат да бъдат
>> навън.','Утре ще бъде още по-хубав ден.')
>> # Converting strings from the vector source to UTF-8. Without this step
>> # in my setup, I don't see Cyrillic letters, even if I set the default
>> # code page to CP1251.
>> # Load the tm library
>> # Create the corpus from the vector source
>> # Create a custom stop list based on the example vector source
>> # Converting to UTF-8
>> # Preprocessing
>> # End of code here
>> Now, if I run inspect(corp), I still see all the stop words intact
>> inside the corpus. I can't figure out why. I tried experimenting with
>> file encodings, with and without explicit statements of encoding, and it
>> never works. As far as I can tell, my code is not wrong, and the
>> function stopwords('language') returns a character vector, so just
>> replacing it by a different character vector should do the trick. Alas,
>> no list of stop words for Bulgarian language is available as part of the
>> tm package (not surprisingly).
>> In the above example, I also tried to read in the list of stop words
>> from a file using the scan function, per the example in my original
>> message. It also fails to remove stop words, without any warnings or
>> error messages.
>> An alternative I tried was to convert to a term-document matrix, and
>> then loop through the words inside and remove those that are also on the
>> stop list. That only partially works for two reasons. The TDM is
>> actually a list, and I am not sure what code I need to use if I delete
>> words, but do not update the underlying indeces. And second, some of the
>> words still don't get removed even though they are in the list. But that
>> is another issue altogether...
>> Thanks for your attention and for your help!
> Thanks for the reproducible example. Indeed, it does not work here
> either (Linux with UTF-8 locale). The problem seems to be in the call to
> gsub() in removeWords: the pattern "\\b" does not match anything when
> perl=TRUE. With perl=FALSE, it works.
> gsub("днес", "", "днес е хубав")
> #  " е хубав"
> gsub("днес", "", "днес е хубав", perl=TRUE)
> #  " е хубав"
> gsub("\\bднес\\b", "", "днес е хубав")
> #  " е хубав"
> gsub("\\bднес\\b", "", "днес е хубав", perl=TRUE)
> #  "днес е хубав"
> It looks like some non-ASCII characters like é or € are supported, but
> not others like œ or the Cyrillic characters you provided.
> For a temporary solution, you can define this function to replace the
> one provided by tm:
> removeWords.PlainTextDocument <- function (x, words)
> gsub(sprintf("\\b(%s)\\b", paste(words, collapse = "|")), "", x)
> I have CCed tm's developer, Ingo Feinerer, to see if he has an idea to
> fix the problem in tm; but this looks like a bug in R (or in perl
>> On 9.4.2013 г. 22:55 ч., Milan Bouchet-Valat wrote:
>>> Le mardi 09 avril 2013 à 10:10 +0300, Ventseslav Kozarev, MPP a écrit :
>>>> I bumped into a serious issue while trying to analyse some texts in
>>>> Bulgarian language (with the tm package). I import a tab-separated csv
>>>> file, which holds a total of 22 variables, most of which are text cells
>>>> (not factors), using the read.delim function:
>>>> (I also tried the above with UTF-8 encoding on a UTF-8-saved file.)
>>>> I have my list of stop words written in a separate text file, one word
>>>> per line, which I read into R using the scan function:
>>>> (also tried with UTF-8 here on a correspondingly encoded file)
>>>> I currently only test with a corpus based on the contents of just one
>>>> variable, and I construct the corpus from a VectorSource. When I run
>>>> inspect, all seems fine and I can see the text properly, with unicode
>>>> characters present:
>>>> However, no matter what I do - like which encoding I select - UTF-8 or
>>>> CP1251, which is the typical code page for Bulgarian texts, I cannot get
>>>> to remove the stop words from my corpus. The issue is present in both
>>>> Linux and Windows, and across the computers I use R on, and I don't
>>>> think it is related to bad configuration. Removal of punctuation, white
>>>> space, and numbers is flawless, but the inability to remove stop words
>>>> prevents me from further analysing the texts.
>>>> Has somebody had experience with languages other than English, and for
>>>> which there is no predefined stop list available through the stopwords
>>>> function? I will highly appreciate any tips and advice!
>>> Well, at least show us the code that you use to remove stopwords... Can
>>> you provide a reproducible example with a toy corpus?
>>>> Thanks in advance,
>>>> R-help at r-project.org mailing list
>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help