[R] R 2.10.0: Error in gsub/calloc
William Dunlap
wdunlap at tibco.com
Wed Nov 4 05:21:31 CET 2009
Here is a more self-contained way to reproduce the problem in 2.10.0
using the prebuilt Windows executable. Putting a trace on gsub in
the call to strapply showed that it died in the first call to gsub
when the replacement included "\\1" and the string was about 900000
characters long (and included 150000 "words"). It looks like it
dies if the string is >= 731248 characters.
> d<-substring(paste(collapse=" ", sapply(1:150000,function(i)"abcde")), 1, 731248)
> nchar(d)
[1] 731248
> substring(d, nchar(d)-10)
[1] " abcde abcd"
> p<-gsub("([[:alpha:]]+)", "\\1", d, perl=FALSE)
Error in gsub("([[:alpha:]]+)", "\\1", d, perl = FALSE) :
Calloc could not allocate (-2146542248 of 1) memory
In addition: Warning messages:
1: In gsub("([[:alpha:]]+)", "\\1", d, perl = FALSE) :
Reached total allocation of 1535Mb: see help(memory.size)
2: In gsub("([[:alpha:]]+)", "\\1", d, perl = FALSE) :
Reached total allocation of 1535Mb: see help(memory.size)
> p<-gsub("([[:alpha:]]+)", "\\1", d, perl=TRUE)
Error in gsub("([[:alpha:]]+)", "\\1", d, perl = TRUE) :
Calloc could not allocate (-2146542248 of 1) memory
In addition: Warning messages:
1: In gsub("([[:alpha:]]+)", "\\1", d, perl = TRUE) :
Reached total allocation of 1535Mb: see help(memory.size)
2: In gsub("([[:alpha:]]+)", "\\1", d, perl = TRUE) :
Reached total allocation of 1535Mb: see help(memory.size)
Make d one character shorter and it succeeds with either
perl=TRUE or perl=FALSE.
> version
_
platform i386-pc-mingw32
arch i386
os mingw32
system i386, mingw32
status
major 2
minor 10.0
year 2009
month 10
day 26
svn rev 50208
language R
version.string R version 2.10.0 (2009-10-26)
> sessionInfo()
R version 2.10.0 (2009-10-26)
i386-pc-mingw32
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] tcltk_2.10.0
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
> -----Original Message-----
> From: r-help-bounces at r-project.org
> [mailto:r-help-bounces at r-project.org] On Behalf Of Richard R. Liu
> Sent: Tuesday, November 03, 2009 3:00 PM
> To: Kenneth Roy Cabrera Torres
> Cc: r-help at r-project.org; Uwe Ligges
> Subject: Re: [R] R 2.10.0: Error in gsub/calloc
>
> Kenneth,
>
> Thanks for the hint. I downloaded and installed the latest
> patch, but
> to no avail. I can reproduce the error on a single sentence, the
> longest in the document. It contains 743,393 characters. It
> isn't a
> true sentence, but since it is more than three standard deviations
> longer than the mean sentence length, I might be able to use
> the mean
> and the standard deviation as a way of weeding ot the really evident
> "non-sentences" before I take into account the
> characteristics of the
> the tokens.
>
> Regards,
> Richard
>
> On Nov 3, 2009, at 20:44 , Kenneth Roy Cabrera Torres wrote:
>
> > Try the patch version...
> > Maybe is the same problem I had with large
> > database when using gsub()
> >
> > HTH
> >
> > El mar, 03-11-2009 a las 20:31 +0100, Richard R. Liu escribió:
> >> I apologize for not being clear. d is a character vector of length
> >> 158908. Each element in the vector has been designated by
> sentDetect
> >> (package: openNLP) as a sentence. Some of these are really
> >> sentences. Others are merely groups of meaningless characters
> >> separated by white space. strapply is a function in the package
> >> gosubfn. It applies to each element of the first argument the
> >> regular
> >> expression (second argument). Every match is then sent to the
> >> designated function (third argument, in my case missing, hence the
> >> identity function). Thus, with strapply I am simply performing a
> >> white-space tokenization of each sentence. I am doing this in the
> >> hope of being able to distinguish true sentences from false ones on
> >> the basis of mean length of token, maximum length of token, or
> >> similar.
> >>
> >> Richard R. Liu
> >> Dittingerstr. 33
> >> CH-4053 Basel
> >> Switzerland
> >>
> >> Tel.: +41 61 331 10 47
> >> Email: richard.liu at pueo-owl.ch
> >>
> >>
> >> On Nov 3, 2009, at 18:30 , Uwe Ligges wrote:
> >>
> >>>
> >>>
> >>> richard.liu at pueo-owl.ch wrote:
> >>>> I'm running R 2.10.0 under Mac OS X 10.5.8; however, I
> don't think
> >>>> this
> >>>> is a Mac-specific problem.
> >>>> I have a very large (158,908 possible sentences, ca. 58 MB) plain
> >>>> text
> >>>> document d which I am
> >>>> trying to tokenize: t <- strapply(d, "\\w+", perl = T). I am
> >>>> encountering the following error:
> >>>
> >>>
> >>> What is strapply() and what is d?
> >>>
> >>> Uwe Ligges
> >>>
> >>>
> >>>
> >>>
> >>>> Error in base::gsub(pattern, rs, x, ...) :
> >>>> Calloc could not allocate (-1398215180 of 1) memory
> >>>> This happens regardless of whether I run in 32- or
> 64-bit mode.
> >>>> The
> >>>> machine has 8 GB of RAM, so
> >>>> I can hardly believe that RAM is a problem.
> >>>> Thanks,
> >>>> Richard
> >>>> ______________________________________________
> >>>> R-help at r-project.org mailing list
> >>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> >>>> and provide commented, minimal, self-contained,
> reproducible code.
> >>
> >>
> >> --Apple-Mail-8--203371287--
> >>
> >> ______________________________________________
> >> R-help at r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
>
>
More information about the R-help
mailing list