[R] R 2.10.0: Error in gsub/calloc
Richard R. Liu
richard.liu at pueo-owl.ch
Fri Nov 6 07:43:05 CET 2009
Bert,
Thanks for the tip. Yes, strsplit works, and works fast! For me,
white-space tokenization means splitting at the white spaces, so the
"^" and the outermost square brackets should/can be omitted.
Regards ... from Basel to South San Francisco,
Richard
On Nov 3, 2009, at 22:03 , Bert Gunter wrote:
> Try:
>
> tokens <- strsplit(d,"[^[:space:]]+")
>
> This splits each "sentence" in your vector into a vector of groups of
> whitespace characters that you can then play with as you described,
> I think
> (The results is a list of such vectors -- see strsplit()).
>
> ## example:
>
>> x <- "xx xdfg; *&^%kk "
>
>> strsplit(x,"[^[:blank:]]+")
> [[1]]
> [1] "" " " " " " "
>
>
> You might have to use PERL = TRUE and "\\w+" depending on your
> locale and
> what "[:space:]" does there.
>
> If this works, it should be way faster than strapply() and should
> not have
> any memory allocation issues either.
>
> HTH.
>
> Bert Gunter
> Genentech Nonclinical Biostatistics
>
>
>
> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org
> ] On
> Behalf Of Richard R. Liu
> Sent: Tuesday, November 03, 2009 11:32 AM
> To: Uwe Ligges
> Cc: r-help at r-project.org
> Subject: Re: [R] R 2.10.0: Error in gsub/calloc
>
> I apologize for not being clear. d is a character vector of length
> 158908. Each element in the vector has been designated by sentDetect
> (package: openNLP) as a sentence. Some of these are really
> sentences. Others are merely groups of meaningless characters
> separated by white space. strapply is a function in the package
> gosubfn. It applies to each element of the first argument the regular
> expression (second argument). Every match is then sent to the
> designated function (third argument, in my case missing, hence the
> identity function). Thus, with strapply I am simply performing a
> white-space tokenization of each sentence. I am doing this in the
> hope of being able to distinguish true sentences from false ones on
> the basis of mean length of token, maximum length of token, or
> similar.
>
> Richard R. Liu
> Dittingerstr. 33
> CH-4053 Basel
> Switzerland
>
> Tel.: +41 61 331 10 47
> Email: richard.liu at pueo-owl.ch
>
>
> On Nov 3, 2009, at 18:30 , Uwe Ligges wrote:
>
>>
>>
>> richard.liu at pueo-owl.ch wrote:
>>> I'm running R 2.10.0 under Mac OS X 10.5.8; however, I don't think
>>> this
>>> is a Mac-specific problem.
>>> I have a very large (158,908 possible sentences, ca. 58 MB) plain
>>> text
>>> document d which I am
>>> trying to tokenize: t <- strapply(d, "\\w+", perl = T). I am
>>> encountering the following error:
>>
>>
>> What is strapply() and what is d?
>>
>> Uwe Ligges
>>
>>
>>
>>
>>> Error in base::gsub(pattern, rs, x, ...) :
>>> Calloc could not allocate (-1398215180 of 1) memory
>>> This happens regardless of whether I run in 32- or 64-bit mode. The
>>> machine has 8 GB of RAM, so
>>> I can hardly believe that RAM is a problem.
>>> Thanks,
>>> Richard
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list