[R-SIG-Mac] StringIndexOutOfBoundsException in RWeka
Kurt Hornik
Kurt.Hornik at wu.ac.at
Tue Jan 12 19:20:45 CET 2010
>>>>> Richard R Liu writes:
> I have narrowed the problem down to this:
>> NGramTokenizer("-", control = wctrl)
> Error in .jcall("weka/core/tokenizers/Tokenizer", "[S", "tokenize", .jcast(tokenizer, :
> java.lang.StringIndexOutOfBoundsException: String index out of range: 1
> Indeed, the 21226th sentence contains a segment composed of a single
> hyphen. I am using the default delimiters of the WEKA control. The
> hyphen is thus not a delimiter. A segment consisting of two
> consecutive hyphens ("--") does not cause the exception.
Thanks. This seems to be a bug in Weka itself, so there is not really a
lot I can do: perhaps you can report the problem to the upstream
maintainers?
Best
-k
> Regards,
> Richard
> On Tue, 12 Jan 2010 16:50:16 +0100, Richard R. Liu wrote
>> I am running R version 2.10.1 Patched (2010-01-07 r50940) in 64-bit
>> mode under Mac OS X 10.5.8 on a MacBook Pro with 8GB RAM.
>>
>> I am encountering the following error in RWeka:
>>
>> Error in .jcall("weka/core/tokenizers/Tokenizer", "[S", "tokenize",
>> .jcast(tokenizer, : java.lang.StringIndexOutOfBoundsException:
>> String index out of range: 1
>>
>> Here is the code that is causing the problem:
>>
>> > library(rJava)
>> > (.jinit(parameters = "-Xmx3000m"))
>> > library(RWeka)
>> > wctrl <- Weka_control(min = 1, max = 4)
>> > lseg.4gram <- lapply(lseg, NGramTokenizer, control = wctrl)
>>
>> lseg is a list of 965193 sentences, each of which consists of one or
>> more segments. For example, lseg[[1]] is
>>
>> [[1]]
>> [1] "calculation of results xxxx activity is defined as the increase
>> in radioactivity " [2] "in dpm"
>> [3] "in the pellet "
>> [4] "xxx"
>>
>> [5] ""
>> [6] "caused by the addition of xx xxxx"
>>
>> lapply should build 1-, 2-, 3- and 4-grams of each sentence segment.
>> Is there any way to solve or circumvent the error? In Java
>> Preferences on the Mac I have specified for applications Java SE 6
>> 64- bit, then J2SE 5.0 64-bit, before other 32-bit versions.
>>
>> (Side remark: I'm surprised that it only does this for the first
>> and last segments of the first sentence. Admittedly, the other
>> segments have less than 4 grams, but that should not stop it from
>> building n- grams consisting of fewer grams!)
>>
>> Thanks,
>> Richard
>> ------
>> Richard R. Liu
>> Dittingerstr. 33
>> CH-4053 Basel
>> Switzerland
>>
>> Tel.: +41 61 331 10 47
>> Email: richard.liu at pueo-owl.ch
> --
> Richard R. Liu
> Dittingerstr. 33
> CH-4053 Basel
> Switzerland
> Tel.: +41 61 331 10 47
> Email: richard.liu at pueo-owl.ch
More information about the R-SIG-Mac
mailing list