[R-SIG-Mac] StringIndexOutOfBoundsException in RWeka

Kurt Hornik Kurt.Hornik at wu.ac.at
Tue Jan 12 19:20:45 CET 2010


>>>>> Richard R Liu writes:

> I have narrowed the problem down to this:
>> NGramTokenizer("-", control = wctrl)
> Error in .jcall("weka/core/tokenizers/Tokenizer", "[S", "tokenize", .jcast(tokenizer,  : 
>   java.lang.StringIndexOutOfBoundsException: String index out of range: 1

> Indeed, the 21226th sentence contains a segment composed of a single
> hyphen.  I am using the default delimiters of the WEKA control.  The
> hyphen is thus not a delimiter.  A segment consisting of two
> consecutive hyphens ("--") does not cause the exception.

Thanks.  This seems to be a bug in Weka itself, so there is not really a
lot I can do: perhaps you can report the problem to the upstream
maintainers?

Best
-k

> Regards,
> Richard

> On Tue, 12 Jan 2010 16:50:16 +0100, Richard R. Liu wrote
>> I am running R version 2.10.1 Patched (2010-01-07 r50940) in 64-bit 
>> mode under Mac OS X 10.5.8 on a MacBook Pro with 8GB RAM.
>> 
>> I am encountering the following error in RWeka:
>> 
>> Error in .jcall("weka/core/tokenizers/Tokenizer", "[S", "tokenize",
>> .jcast(tokenizer,  :   java.lang.StringIndexOutOfBoundsException: 
>> String index out of range: 1
>> 
>> Here is the code that is causing the problem:
>> 
>> > library(rJava)
>> > (.jinit(parameters = "-Xmx3000m"))
>> > library(RWeka)
>> > wctrl <- Weka_control(min = 1, max = 4) 
>> > lseg.4gram <- lapply(lseg, NGramTokenizer, control = wctrl)
>> 
>> lseg is a list of 965193 sentences, each of which consists of one or 
>> more segments.  For example, lseg[[1]] is
>> 
>> [[1]]
>> [1] "calculation of results xxxx activity is defined as the increase 
>> in radioactivity " [2] "in dpm"                                      
>> [3] "in the pellet "            
>> [4] "xxx"
>> 
>> [5] ""                                                           
>> [6] "caused by the addition of xx xxxx"
>> 
>> lapply should build 1-, 2-, 3- and 4-grams of each sentence segment. 
>> Is there any way to solve or circumvent the error?  In Java 
>> Preferences on the Mac I have specified for applications Java SE 6 
>> 64- bit, then J2SE 5.0 64-bit, before other 32-bit versions.
>> 
>> (Side remark:  I'm surprised that it only does this for the first 
>> and last segments of the first sentence.  Admittedly, the other 
>> segments have less than 4 grams, but that should not stop it from 
>> building n- grams consisting of fewer grams!)
>> 
>> Thanks,
>> Richard
>> ------
>> Richard R. Liu
>> Dittingerstr. 33
>> CH-4053 Basel
>> Switzerland
>> 
>> Tel.:  +41 61 331 10 47
>> Email:  richard.liu at pueo-owl.ch


> --
> Richard R. Liu
> Dittingerstr. 33
> CH-4053 Basel
> Switzerland

> Tel.:  +41 61 331 10 47
> Email:  richard.liu at pueo-owl.ch



More information about the R-SIG-Mac mailing list