[R-SIG-Mac] StringIndexOutOfBoundsException in RWeka
Richard R. Liu
richard.liu at pueo-owl.ch
Tue Jan 12 18:22:25 CET 2010
I have narrowed the problem down to this:
> NGramTokenizer("-", control = wctrl)
Error in .jcall("weka/core/tokenizers/Tokenizer", "[S", "tokenize", .jcast(tokenizer, :
java.lang.StringIndexOutOfBoundsException: String index out of range: 1
Indeed, the 21226th sentence contains a segment composed of a single hyphen. I am using the
default delimiters of the WEKA control. The hyphen is thus not a delimiter. A segment consisting of
two consecutive hyphens ("--") does not cause the exception.
On Tue, 12 Jan 2010 16:50:16 +0100, Richard R. Liu wrote
> I am running R version 2.10.1 Patched (2010-01-07 r50940) in 64-bit
> mode under Mac OS X 10.5.8 on a MacBook Pro with 8GB RAM.
> I am encountering the following error in RWeka:
> Error in .jcall("weka/core/tokenizers/Tokenizer", "[S", "tokenize",
> .jcast(tokenizer, : java.lang.StringIndexOutOfBoundsException:
> String index out of range: 1
> Here is the code that is causing the problem:
> > library(rJava)
> > (.jinit(parameters = "-Xmx3000m"))
> > library(RWeka)
> > wctrl <- Weka_control(min = 1, max = 4)
> > lseg.4gram <- lapply(lseg, NGramTokenizer, control = wctrl)
> lseg is a list of 965193 sentences, each of which consists of one or
> more segments. For example, lseg[] is
>  "calculation of results xxxx activity is defined as the increase
> in radioactivity "  "in dpm"
>  "in the pellet "
>  "xxx"
>  ""
>  "caused by the addition of xx xxxx"
> lapply should build 1-, 2-, 3- and 4-grams of each sentence segment.
> Is there any way to solve or circumvent the error? In Java
> Preferences on the Mac I have specified for applications Java SE 6
> 64- bit, then J2SE 5.0 64-bit, before other 32-bit versions.
> (Side remark: I'm surprised that it only does this for the first
> and last segments of the first sentence. Admittedly, the other
> segments have less than 4 grams, but that should not stop it from
> building n- grams consisting of fewer grams!)
> Richard R. Liu
> Dittingerstr. 33
> CH-4053 Basel
> Tel.: +41 61 331 10 47
> Email: richard.liu at pueo-owl.ch
Richard R. Liu
Tel.: +41 61 331 10 47
Email: richard.liu at pueo-owl.ch
More information about the R-SIG-Mac