[R-SIG-Mac] StringIndexOutOfBoundsException in RWeka

Richard R. Liu richard.liu at pueo-owl.ch
Tue Jan 12 18:14:05 CET 2010


I have narrowed the problem down to this:

NGramTokenizer("-", control = Weka_control(min = 1, max = 4))

The string actually occurs as fourth segment in the 21,226th sentence.  I find this strange, since I am 
using the default delimiters ' \r\n\t.,;:'"()?!', which do not contain a hyphen.

Regards,
Richard

On Tue, 12 Jan 2010 16:50:16 +0100, Richard R. Liu wrote
> I am running R version 2.10.1 Patched (2010-01-07 r50940) in 64-bit 
> mode under Mac OS X 10.5.8 on a MacBook Pro with 8GB RAM.
> 
> I am encountering the following error in RWeka:
> 
> Error in .jcall("weka/core/tokenizers/Tokenizer", "[S", "tokenize",
>  .jcast(tokenizer,  :   java.lang.StringIndexOutOfBoundsException: 
> String index out of range: 1
> 
> Here is the code that is causing the problem:
> 
> > library(rJava)
> > (.jinit(parameters = "-Xmx3000m"))
> > library(RWeka)
> > wctrl <- Weka_control(min = 1, max = 4) 
> > lseg.4gram <- lapply(lseg, NGramTokenizer, control = wctrl)
> 
> lseg is a list of 965193 sentences, each of which consists of one or 
> more segments.  For example, lseg[[1]] is
> 
> [[1]]
> [1] "calculation of results xxxx activity is defined as the increase 
> in radioactivity " [2] "in dpm"                                      
>                                      [3] "in the pellet "            
>                                                        [4] "xxx"
> 
>     [5] ""                                                           
>                       [6] "caused by the addition of xx xxxx"
> 
> lapply should build 1-, 2-, 3- and 4-grams of each sentence segment. 
>  Is there any way to solve or circumvent the error?  In Java 
> Preferences on the Mac I have specified for applications Java SE 6 
> 64- bit, then J2SE 5.0 64-bit, before other 32-bit versions.
> 
> (Side remark:  I'm surprised that it only does this for the first 
> and last segments of the first sentence.  Admittedly, the other 
> segments have less than 4 grams, but that should not stop it from 
> building n- grams consisting of fewer grams!)
> 
> Thanks,
> Richard
> ------
> Richard R. Liu
> Dittingerstr. 33
> CH-4053 Basel
> Switzerland
> 
> Tel.:  +41 61 331 10 47
> Email:  richard.liu at pueo-owl.ch


--
Richard R. Liu
Dittingerstr. 33
CH-4053 Basel
Switzerland

Tel.:  +41 61 331 10 47
Email:  richard.liu at pueo-owl.ch



More information about the R-SIG-Mac mailing list