[R] R 2.10.0: Error in gsub/calloc

Prof Brian Ripley ripley at stats.ox.ac.uk
Wed Nov 4 07:46:15 CET 2009


This seems to be simply integer overflow in a calculation.
Changed in R-patched to use doubles.

The issue I patched for Kenneth Roy Cabrera was for perl = FALSE only.

On Tue, 3 Nov 2009, William Dunlap wrote:

> Here is a more self-contained way to reproduce the problem in 2.10.0
> using the prebuilt Windows executable.  Putting a trace on gsub in
> the call to strapply showed that it died in the first call to gsub
> when the replacement included "\\1" and the string was about 900000
> characters long (and included 150000 "words").  It looks like it
> dies if the string is >= 731248 characters.
>
>> d<-substring(paste(collapse=" ", sapply(1:150000,function(i)"abcde")), 1, 731248)
>> nchar(d)
> [1] 731248
>> substring(d, nchar(d)-10)
> [1] " abcde abcd"
>> p<-gsub("([[:alpha:]]+)", "\\1", d, perl=FALSE)
> Error in gsub("([[:alpha:]]+)", "\\1", d, perl = FALSE) :
>  Calloc could not allocate (-2146542248 of 1) memory
> In addition: Warning messages:
> 1: In gsub("([[:alpha:]]+)", "\\1", d, perl = FALSE) :
>  Reached total allocation of 1535Mb: see help(memory.size)
> 2: In gsub("([[:alpha:]]+)", "\\1", d, perl = FALSE) :
>  Reached total allocation of 1535Mb: see help(memory.size)
>> p<-gsub("([[:alpha:]]+)", "\\1", d, perl=TRUE)
> Error in gsub("([[:alpha:]]+)", "\\1", d, perl = TRUE) :
>  Calloc could not allocate (-2146542248 of 1) memory
> In addition: Warning messages:
> 1: In gsub("([[:alpha:]]+)", "\\1", d, perl = TRUE) :
>  Reached total allocation of 1535Mb: see help(memory.size)
> 2: In gsub("([[:alpha:]]+)", "\\1", d, perl = TRUE) :
>  Reached total allocation of 1535Mb: see help(memory.size)
>
> Make d one character shorter and it succeeds with either
> perl=TRUE or perl=FALSE.
>
>> version
>               _
> platform       i386-pc-mingw32
> arch           i386
> os             mingw32
> system         i386, mingw32
> status
> major          2
> minor          10.0
> year           2009
> month          10
> day            26
> svn rev        50208
> language       R
> version.string R version 2.10.0 (2009-10-26)
>> sessionInfo()
> R version 2.10.0 (2009-10-26)
> i386-pc-mingw32
>
> locale:
> [1] LC_COLLATE=English_United States.1252
> [2] LC_CTYPE=English_United States.1252
> [3] LC_MONETARY=English_United States.1252
> [4] LC_NUMERIC=C
> [5] LC_TIME=English_United States.1252
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> loaded via a namespace (and not attached):
> [1] tcltk_2.10.0
>
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com
>
>> -----Original Message-----
>> From: r-help-bounces at r-project.org
>> [mailto:r-help-bounces at r-project.org] On Behalf Of Richard R. Liu
>> Sent: Tuesday, November 03, 2009 3:00 PM
>> To: Kenneth Roy Cabrera Torres
>> Cc: r-help at r-project.org; Uwe Ligges
>> Subject: Re: [R] R 2.10.0: Error in gsub/calloc
>>
>> Kenneth,
>>
>> Thanks for the hint.  I downloaded and installed the latest
>> patch, but
>> to no avail.  I can reproduce the error on a single sentence, the
>> longest in the document.  It contains 743,393 characters.  It
>> isn't a
>> true sentence, but since it is more than three standard deviations
>> longer than the mean sentence length, I might be able to use
>> the mean
>> and the standard deviation as a way of weeding ot the really evident
>> "non-sentences" before I take into account the
>> characteristics of the
>> the tokens.
>>
>> Regards,
>> Richard
>>
>> On Nov 3, 2009, at 20:44 , Kenneth Roy Cabrera Torres wrote:
>>
>>> Try the patch version...
>>> Maybe is the same problem I had with large
>>> database when using gsub()
>>>
>>> HTH
>>>
>>> El mar, 03-11-2009 a las 20:31 +0100, Richard R. Liu escribió:
>>>> I apologize for not being clear.  d is a character vector of length
>>>> 158908.  Each element in the vector has been designated by
>> sentDetect
>>>> (package: openNLP) as a sentence.  Some of these are really
>>>> sentences.  Others are merely groups of meaningless characters
>>>> separated by white space.  strapply is a function in the package
>>>> gosubfn.  It applies to each element of the first argument the
>>>> regular
>>>> expression (second argument).  Every match is then sent to the
>>>> designated function (third argument, in my case missing, hence the
>>>> identity function).  Thus, with strapply I am simply performing a
>>>> white-space tokenization of each sentence.  I am doing this in the
>>>> hope of being able to distinguish true sentences from false ones on
>>>> the basis of mean length of token, maximum length of token, or
>>>> similar.
>>>>
>>>> Richard R. Liu
>>>> Dittingerstr. 33
>>>> CH-4053 Basel
>>>> Switzerland
>>>>
>>>> Tel.:  +41 61 331 10 47
>>>> Email:  richard.liu at pueo-owl.ch
>>>>
>>>>
>>>> On Nov 3, 2009, at 18:30 , Uwe Ligges wrote:
>>>>
>>>>>
>>>>>
>>>>> richard.liu at pueo-owl.ch wrote:
>>>>>> I'm running R 2.10.0 under Mac OS X 10.5.8; however, I
>> don't think
>>>>>> this
>>>>>> is a Mac-specific problem.
>>>>>> I have a very large (158,908 possible sentences, ca. 58 MB) plain
>>>>>> text
>>>>>> document d which I am
>>>>>> trying to tokenize:  t <- strapply(d, "\\w+", perl = T).  I am
>>>>>> encountering the following error:
>>>>>
>>>>>
>>>>> What is strapply() and what is d?
>>>>>
>>>>> Uwe Ligges
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Error in base::gsub(pattern, rs, x, ...) :
>>>>>> Calloc could not allocate (-1398215180 of 1) memory
>>>>>> This happens regardless of whether I run in 32- or
>> 64-bit mode.
>>>>>> The
>>>>>> machine has 8 GB of RAM, so
>>>>>> I can hardly believe that RAM is a problem.
>>>>>> Thanks,
>>>>>> Richard
>>>>>> ______________________________________________
>>>>>> R-help at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>>>>>> and provide commented, minimal, self-contained,
>> reproducible code.
>>>>
>>>>
>>>> --Apple-Mail-8--203371287--
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595


More information about the R-help mailing list