[R] R 2.10.0: Error in gsub/calloc

Gabor Grothendieck ggrothendieck at gmail.com
Fri Nov 6 16:09:45 CET 2009


I will have a look at it this weekend if you can give me sufficient
info to reproduce it. I noticed there was an attachment on one of your
emails and it seems to be some sort of binary file with no
accompanying description.

On Fri, Nov 6, 2009 at 10:01 AM, Richard R. Liu <richard.liu at pueo-owl.ch> wrote:
> Gabor,
>
> What about the error message that I got with strapply?  That seemed to be the
> same kind of problem (i.e., integer overflow of index) as with gsub.
>
> Regards,
> Richard
>
> On Fri, 6 Nov 2009 08:00:06 -0500, Gabor Grothendieck wrote
>> Note that strapply without perl = TRUE runs an order of magnitude
>> faster than with perl = TRUE and takes nearly the same set of regular
>> expressions anyways since its default is tcl regular expressions.
>> strsplit should still be fastest where it applies since splitting is
>> its only purpose.
>>
>> On Fri, Nov 6, 2009 at 1:43 AM, Richard R. Liu <richard.liu at pueo-
>> owl.ch> wrote:
>> > Bert,
>> >
>> > Thanks for the tip.  Yes, strsplit works, and works fast!  For me,
>> > white-space tokenization means splitting at the white spaces, so the "^" and
>> > the outermost square brackets should/can be omitted.
>> >
>> > Regards ... from Basel to South San Francisco,
>> > Richard
>> >
>> > On Nov 3, 2009, at 22:03 , Bert Gunter wrote:
>> >
>> >> Try:
>> >>
>> >> tokens <- strsplit(d,"[^[:space:]]+")
>> >>
>> >> This splits each "sentence" in your vector into a vector of groups of
>> >> whitespace characters that you can then play with as you described, I
>> >> think
>> >> (The results is a list of such vectors -- see strsplit()).
>> >>
>> >> ## example:
>> >>
>> >>> x <- "xx  xdfg; *&^%kk    "
>> >>
>> >>> strsplit(x,"[^[:blank:]]+")
>> >>
>> >> [[1]]
>> >> [1] ""     "  "   " "    "    "
>> >>
>> >>
>> >> You might have to use PERL = TRUE and "\\w+" depending on your locale and
>> >> what "[:space:]" does there.
>> >>
>> >> If this works, it should be way faster than strapply() and should not have
>> >> any memory allocation issues either.
>> >>
>> >> HTH.
>> >>
>> >> Bert Gunter
>> >> Genentech Nonclinical Biostatistics
>> >>
>> >>
>> >>
>> >> -----Original Message-----
>> >> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
>> >> On
>> >> Behalf Of Richard R. Liu
>> >> Sent: Tuesday, November 03, 2009 11:32 AM
>> >> To: Uwe Ligges
>> >> Cc: r-help at r-project.org
>> >> Subject: Re: [R] R 2.10.0: Error in gsub/calloc
>> >>
>> >> I apologize for not being clear.  d is a character vector of length
>> >> 158908.  Each element in the vector has been designated by sentDetect
>> >> (package: openNLP) as a sentence.  Some of these are really
>> >> sentences.  Others are merely groups of meaningless characters
>> >> separated by white space.  strapply is a function in the package
>> >> gosubfn.  It applies to each element of the first argument the regular
>> >> expression (second argument).  Every match is then sent to the
>> >> designated function (third argument, in my case missing, hence the
>> >> identity function).  Thus, with strapply I am simply performing a
>> >> white-space tokenization of each sentence.  I am doing this in the
>> >> hope of being able to distinguish true sentences from false ones on
>> >> the basis of mean length of token, maximum length of token, or similar.
>> >>
>> >> Richard R. Liu
>> >> Dittingerstr. 33
>> >> CH-4053 Basel
>> >> Switzerland
>> >>
>> >> Tel.:  +41 61 331 10 47
>> >> Email:  richard.liu at pueo-owl.ch
>> >>
>> >>
>> >> On Nov 3, 2009, at 18:30 , Uwe Ligges wrote:
>> >>
>> >>>
>> >>>
>> >>> richard.liu at pueo-owl.ch wrote:
>> >>>>
>> >>>> I'm running R 2.10.0 under Mac OS X 10.5.8; however, I don't think
>> >>>> this
>> >>>> is a Mac-specific problem.
>> >>>> I have a very large (158,908 possible sentences, ca. 58 MB) plain
>> >>>> text
>> >>>> document d which I am
>> >>>> trying to tokenize:  t <- strapply(d, "\\w+", perl = T).  I am
>> >>>> encountering the following error:
>> >>>
>> >>>
>> >>> What is strapply() and what is d?
>> >>>
>> >>> Uwe Ligges
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>> Error in base::gsub(pattern, rs, x, ...) :
>> >>>> Calloc could not allocate (-1398215180 of 1) memory
>> >>>> This happens regardless of whether I run in 32- or 64-bit mode.  The
>> >>>> machine has 8 GB of RAM, so
>> >>>> I can hardly believe that RAM is a problem.
>> >>>> Thanks,
>> >>>> Richard
>> >>>> ______________________________________________
>> >>>> R-help at r-project.org mailing list
>> >>>> https://stat.ethz.ch/mailman/listinfo/r-help
>> >>>> PLEASE do read the posting guide
>> >>
>> >> http://www.R-project.org/posting-guide.html
>> >>>>
>> >>>> and provide commented, minimal, self-contained, reproducible code.
>> >>
>> >
>> >
>> > ______________________________________________
>> > R-help at r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>> >
>> >
>
>
> --
> Richard R. Liu
> Dittingerstr. 33
> CH-4053 Basel
> Switzerland
>
> Tel.:  +41 61 331 10 47
> Email:  richard.liu at pueo-owl.ch
>
>




More information about the R-help mailing list