[R] Question on Stopword Removal from a Cyrillic (Bulgarian)Text
Milan Bouchet-Valat
nalimilan at club.fr
Wed Apr 10 10:29:27 CEST 2013
Le mercredi 10 avril 2013 à 10:50 +0300, Ventseslav Kozarev, MPP a
écrit :
> Hi,
>
> Thanks for taking the time. Here is a more reproducible example of the
> entire process:
>
> # Creating a vector source - stupid text in the Bulgarian language
> bg<-c('Днес е хубав и слънчев ден, в който всички искат да бъдат
> навън.','Утре ще бъде още по-хубав ден.')
>
> # Converting strings from the vector source to UTF-8. Without this step
> # in my setup, I don't see Cyrillic letters, even if I set the default
> # code page to CP1251.
> bg<-iconv(bg,to='UTF-8')
>
> # Load the tm library
> library(tm)
>
> # Create the corpus from the vector source
> corp<-Corpus(VectorSource(bg),readerControl=list(language='bulgarian'))
>
> # Create a custom stop list based on the example vector source
> # Converting to UTF-8
> stoplist<-c('е','и','в','който','всички','да','бъдат','навън','ще','бъде','още')
> stoplist<-iconv(stoplist,to='UTF-8')
>
> # Preprocessing
> corp<-tm_map(corp,removePunctuation)
> corp<-tm_map(corp,removeNumbers)
> corp<-tm_map(corp,tolower)
> corp<-tm_map(corp,removeWords,stoplist)
>
> # End of code here
>
> Now, if I run inspect(corp), I still see all the stop words intact
> inside the corpus. I can't figure out why. I tried experimenting with
> file encodings, with and without explicit statements of encoding, and it
> never works. As far as I can tell, my code is not wrong, and the
> function stopwords('language') returns a character vector, so just
> replacing it by a different character vector should do the trick. Alas,
> no list of stop words for Bulgarian language is available as part of the
> tm package (not surprisingly).
>
> In the above example, I also tried to read in the list of stop words
> from a file using the scan function, per the example in my original
> message. It also fails to remove stop words, without any warnings or
> error messages.
>
> An alternative I tried was to convert to a term-document matrix, and
> then loop through the words inside and remove those that are also on the
> stop list. That only partially works for two reasons. The TDM is
> actually a list, and I am not sure what code I need to use if I delete
> words, but do not update the underlying indeces. And second, some of the
> words still don't get removed even though they are in the list. But that
> is another issue altogether...
>
> Thanks for your attention and for your help!
> Vince
Thanks for the reproducible example. Indeed, it does not work here
either (Linux with UTF-8 locale). The problem seems to be in the call to
gsub() in removeWords: the pattern "\\b" does not match anything when
perl=TRUE. With perl=FALSE, it works.
gsub("днес", "", "днес е хубав")
# [1] " е хубав"
gsub("днес", "", "днес е хубав", perl=TRUE)
# [1] " е хубав"
gsub("\\bднес\\b", "", "днес е хубав")
# [1] " е хубав"
gsub("\\bднес\\b", "", "днес е хубав", perl=TRUE)
# [1] "днес е хубав"
It looks like some non-ASCII characters like é or € are supported, but
not others like œ or the Cyrillic characters you provided.
For a temporary solution, you can define this function to replace the
one provided by tm:
removeWords.PlainTextDocument <- function (x, words)
gsub(sprintf("\\b(%s)\\b", paste(words, collapse = "|")), "", x)
I have CCed tm's developer, Ingo Feinerer, to see if he has an idea to
fix the problem in tm; but this looks like a bug in R (or in perl
regexps).
Regards
> On 9.4.2013 г. 22:55 ч., Milan Bouchet-Valat wrote:
> > Le mardi 09 avril 2013 à 10:10 +0300, Ventseslav Kozarev, MPP a écrit :
> >> Hi,
> >>
> >> I bumped into a serious issue while trying to analyse some texts in
> >> Bulgarian language (with the tm package). I import a tab-separated csv
> >> file, which holds a total of 22 variables, most of which are text cells
> >> (not factors), using the read.delim function:
> >>
> >> data<-read.delim("bigcompanies_ascii.csv",
> >> header=TRUE,
> >> quote="'",
> >> sep="\t",
> >> as.is=TRUE,
> >> encoding='CP1251',
> >> fileEncoding='CP1251')
> >>
> >> (I also tried the above with UTF-8 encoding on a UTF-8-saved file.)
> >>
> >> I have my list of stop words written in a separate text file, one word
> >> per line, which I read into R using the scan function:
> >>
> >> stoplist<-scan(file='stoplist_ascii.txt',
> >> what='character',
> >> strip.white=TRUE,
> >> blank.lines.skip=TRUE,
> >> fileEncoding='CP1251',
> >> encoding='CP1251')
> >>
> >> (also tried with UTF-8 here on a correspondingly encoded file)
> >>
> >> I currently only test with a corpus based on the contents of just one
> >> variable, and I construct the corpus from a VectorSource. When I run
> >> inspect, all seems fine and I can see the text properly, with unicode
> >> characters present:
> >>
> >> data.corpus<-Corpus(VectorSource(data$variable,encoding='UTF-8'),
> >> readerControl=list(language='bulgarian'))
> >>
> >> However, no matter what I do - like which encoding I select - UTF-8 or
> >> CP1251, which is the typical code page for Bulgarian texts, I cannot get
> >> to remove the stop words from my corpus. The issue is present in both
> >> Linux and Windows, and across the computers I use R on, and I don't
> >> think it is related to bad configuration. Removal of punctuation, white
> >> space, and numbers is flawless, but the inability to remove stop words
> >> prevents me from further analysing the texts.
> >>
> >> Has somebody had experience with languages other than English, and for
> >> which there is no predefined stop list available through the stopwords
> >> function? I will highly appreciate any tips and advice!
> > Well, at least show us the code that you use to remove stopwords... Can
> > you provide a reproducible example with a toy corpus?
> >
> >> Thanks in advance,
> >> Vince
> >>
> >> ______________________________________________
> >> R-help at r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> >
More information about the R-help
mailing list