[R-SIG-Mac] StringIndexOutOfBoundsException in RWeka

Richard R. Liu richard.liu at pueo-owl.ch
Tue Jan 12 16:50:16 CET 2010

I am running R version 2.10.1 Patched (2010-01-07 r50940) in 64-bit mode under Mac OS X 10.5.8 
on a MacBook Pro with 8GB RAM.

I am encountering the following error in RWeka:

Error in .jcall("weka/core/tokenizers/Tokenizer", "[S", "tokenize", .jcast(tokenizer,  : 
  java.lang.StringIndexOutOfBoundsException: String index out of range: 1

Here is the code that is causing the problem:

> library(rJava)
> (.jinit(parameters = "-Xmx3000m"))
> library(RWeka)
> wctrl <- Weka_control(min = 1, max = 4) 
> lseg.4gram <- lapply(lseg, NGramTokenizer, control = wctrl)

lseg is a list of 965193 sentences, each of which consists of one or more segments.  For example, 
lseg[[1]] is

[1] "calculation of results xxxx activity is defined as the increase in radioactivity "
[2] "in dpm"                                                                           
[3] "in the pellet "                                                                   
[4] "xxx"                                                                              
[5] ""                                                                                 
[6] "caused by the addition of xx xxxx"                                                

lapply should build 1-, 2-, 3- and 4-grams of each sentence segment.  Is there any way to solve or 
circumvent the error?  In Java Preferences on the Mac I have specified for applications Java SE 6 64-
bit, then J2SE 5.0 64-bit, before other 32-bit versions.

(Side remark:  I'm surprised that it only does this for the first and last segments of the first sentence.  
Admittedly, the other segments have less than 4 grams, but that should not stop it from building n-
grams consisting of fewer grams!)

Richard R. Liu
Dittingerstr. 33
CH-4053 Basel

Tel.:  +41 61 331 10 47
Email:  richard.liu at pueo-owl.ch

More information about the R-SIG-Mac mailing list