[Rd] Memory issues in "aggregate" (PR#5829)

Prof Brian D Ripley ripley at stats.ox.ac.uk
Tue Dec 16 09:03:21 MET 2003


I have read through this and been unable to find a *bug* report in it,
Please read the section on BUGS in the R FAQ and explain where the bug here
is -- as far as I can see it is that a user has managed to write code that
exceeds the memory capacity of his computer.

On Tue, 16 Dec 2003 znmeb at aracnet.com wrote:

> Full_Name: Ed Borasky
> Version: 1.8.1
> OS: Windows XP Professional
> Submission from: (NULL) (208.252.96.195)
>
>
> R 1.8.1 seems to be running into a memory allocation problem in the "aggregate"
> function. I have a rather large dataset (14 columns by 223,000 rows -- almost 40
> megabytes) and a script that performs some processing on it. The system is a 768
> MB Pentium 4. Here's the console log:
>
> ---------------------------------------------------------------------------------
> R : Copyright 2003, The R Foundation for Statistical Computing
> Version 1.8.1  (2003-11-21), ISBN 3-900051-00-3
>
> R is free software and comes with ABSOLUTELY NO WARRANTY.
> You are welcome to redistribute it under certain conditions.
> Type 'license()' or 'licence()' for distribution details.
>
> R is a collaborative project with many contributors.
> Type 'contributors()' for more information and
> 'citation()' on how to cite R in publications.
>
> Type 'demo()' for some demos, 'help()' for on-line help, or
> 'help.start()' for a HTML browser interface to help.
> Type 'q()' to quit R.
>
> [Previously saved workspace restored]
>
> > source("script.R",echo=TRUE)
>
> > rm(list = ls())
>
> > cvar <- function(x) sd(x)/mean(x)
>
> > library(sm)
> Library `sm', version 2; Copyright (C) 1997, 2000 A.W.Bowman & A.Azzalini
> type help(sm) for summary information
>
> > callsperhour <- function(x) length(x)/12
>
> > profiles <- subset(read.csv("profiles.csv"), hourofday >=
>     7 & hourofday <= 19 & dayofweek >= 1 & dayofweek <= 5)
>
> > nrow(profiles)
> [1] 100520
>
> > attach(profiles)
>
> > pseudo.hist <- aggregate(duration, list(Delta), length)
>
> > colnames(pseudo.hist) <- c("Delta", "N")
>
> > detach(profiles)
>
> > gc()
>           used (Mb) gc trigger (Mb)
> Ncells  701188 18.8    2683553 71.7
> Vcells 1447712 11.1    8201413 62.6
>
> > memory.profile()
>     NILSXP     SYMSXP    LISTSXP     CLOSXP     ENVSXP    PROMSXP    LANGSXP
>          1       7228     244243       3949        495        773     113819
> SPECIALSXP BUILTINSXP    CHARSXP     LGLSXP                           INTSXP
>        207       1177     283663       4661          0          0         49
>    REALSXP    CPLXSXP     STRSXP     DOTSXP     ANYSXP     VECSXP    EXPRSXP
>      13383          9      24870          0          0       2598          2
>   BCODESXP  EXTPTRSXP WEAKREFSXP
>          0         93          0
>
> > memory.size(max = TRUE)
> [1] 224669696
>
> > memory.size(max = FALSE)
> [1] 81072656
>
> > attach(pseudo.hist)
>
> > pseudo.hist <- pseudo.hist[order(as.numeric(as.character(Delta))),
>     ]
>
> > write.table(pseudo.hist, file = "pseudo-hist.csv",
>     sep = ",", row.names = FALSE)
>
> > detach(pseudo.hist)
>
> > gc()
>           used (Mb) gc trigger (Mb)
> Ncells  701228 18.8    2146842 57.4
> Vcells 1447740 11.1    5248904 40.1
>
> > memory.profile()
>     NILSXP     SYMSXP    LISTSXP     CLOSXP     ENVSXP    PROMSXP    LANGSXP
>          1       7237     244261       3949        495        773     113819
> SPECIALSXP BUILTINSXP    CHARSXP     LGLSXP                           INTSXP
>        207       1177     283672       4661          0          0         49
>    REALSXP    CPLXSXP     STRSXP     DOTSXP     ANYSXP     VECSXP    EXPRSXP
>      13383          9      24870          0          0       2598          2
>   BCODESXP  EXTPTRSXP WEAKREFSXP
>          0         93          0
>
> > memory.size(max = TRUE)
> [1] 224669696
>
> > memory.size(max = FALSE)
> [1] 81072656
>
> > attach(profiles)
>
> > cphs.site <- aggregate(Timestamp, list(CNumber, localdate),
>     callsperhour)
>
> > colnames(cphs.site) <- c("CNumber", "localdate", "CallsPerHour")
>
> > detach(profiles)
>
> > gc()
>           used (Mb) gc trigger (Mb)
> Ncells  701695 18.8    2146842 57.4
> Vcells 1449346 11.1    5248904 40.1
>
> > memory.profile()
>     NILSXP     SYMSXP    LISTSXP     CLOSXP     ENVSXP    PROMSXP    LANGSXP
>          1       7240     244277       3949        495        773     113819
> SPECIALSXP BUILTINSXP    CHARSXP     LGLSXP                           INTSXP
>        207       1177     284109       4661          0          0         51
>    REALSXP    CPLXSXP     STRSXP     DOTSXP     ANYSXP     VECSXP    EXPRSXP
>      13384          9      24877          0          0       2599          2
>   BCODESXP  EXTPTRSXP WEAKREFSXP
>          0         93          0
>
> > memory.size(max = TRUE)
> [1] 224669696
>
> > memory.size(max = FALSE)
> [1] 82444104
>
> > attach(cphs.site)
>
> > cphs.site <- cphs.site[order(CNumber, localdate),
>     ]
>
> > write.table(cphs.site, file = "cphs-site.csv", sep = ",",
>     row.names = FALSE)
>
> > detach(cphs.site)
>
> > gc()
>           used (Mb) gc trigger (Mb)
> Ncells  701701 18.8    2146842 57.4
> Vcells 1449350 11.1    5248904 40.1
>
> > memory.profile()
>     NILSXP     SYMSXP    LISTSXP     CLOSXP     ENVSXP    PROMSXP    LANGSXP
>          1       7242     244279       3949        495        773     113819
> SPECIALSXP BUILTINSXP    CHARSXP     LGLSXP                           INTSXP
>        207       1177     284111       4661          0          0         51
>    REALSXP    CPLXSXP     STRSXP     DOTSXP     ANYSXP     VECSXP    EXPRSXP
>      13384          9      24877          0          0       2599          2
>   BCODESXP  EXTPTRSXP WEAKREFSXP
>          0         93          0
>
> > memory.size(max = TRUE)
> [1] 224669696
>
> > memory.size(max = FALSE)
> [1] 82444104
>
> > attach(profiles)
>
> > cphs <- aggregate(Timestamp, list(CNumber, IP, localdate),
>     callsperhour)
> Error in makeRestartList(...) : evaluation is nested too deeply: infinite
> recursion?
> >
> ------------------------------------------------------------------------------------
> "profiles.csv" is the 40 MB file. Here's the R code that generates the error:
> ------------------------------------------------------------------------------------
> # keep a log file
> #sink ("script.log")
>
> # clean house
> rm (list=ls())
>
> # definitions, libraries
> cvar<-function(x) sd(x)/mean(x); # coefficient of variation
> library(sm)
> callsperhour<-function(x) length(x)/12
>
> # load data
> profiles<-subset(read.csv("profiles.csv"),
> 	#as.character(localdate)<"2003-07-19"
> 	#&hourofday>=7
> 	hourofday>=7
> 	&hourofday<=19
> 	&dayofweek>=1
> 	&dayofweek<=5)
> #profiles<-subset(profiles,!(localdate=="2003-07-11"&CNumber=="C132185"))
> nrow(profiles)
>
> # compute pseudo-histogram
> attach(profiles)
> pseudo.hist<-aggregate(duration,list(Delta),length)
> colnames(pseudo.hist)<-c("Delta", "N")
> detach(profiles)
> gc()
> memory.profile()
> memory.size(max=TRUE)
> memory.size(max=FALSE)
>
> attach (pseudo.hist)
> pseudo.hist<-pseudo.hist[order(as.numeric(as.character(Delta))),]
> #print (pseudo.hist)
> write.table (pseudo.hist, file="pseudo-hist.csv", sep=",",
> 	row.names=FALSE)
> detach(pseudo.hist)
> gc()
> memory.profile()
> memory.size(max=TRUE)
> memory.size(max=FALSE)
>
> # compute calls per hour for each site/date combo
> attach(profiles)
> cphs.site<-aggregate(Timestamp,list(CNumber,localdate),callsperhour)
> colnames(cphs.site)<-c("CNumber","localdate","CallsPerHour")
> detach(profiles)
> gc()
> memory.profile()
> memory.size(max=TRUE)
> memory.size(max=FALSE)
>
> attach(cphs.site)
> cphs.site<-cphs.site[order(CNumber,localdate),]
> #print (cphs.site)
> write.table (cphs.site, file="cphs-site.csv", sep=",",
> 	row.names=FALSE)
> detach(cphs.site)
> gc()
> memory.profile()
> memory.size(max=TRUE)
> memory.size(max=FALSE)
>
> # compute calls per hour for each site/IP/date combo
> attach(profiles)
> cphs<-aggregate(Timestamp,list(CNumber,IP,localdate),callsperhour)
> colnames(cphs)<-c("CNumber","IP","localdate","CallsPerHour")
> detach(profiles)
> gc()
> memory.profile()
> memory.size(max=TRUE)
> memory.size(max=FALSE)
> ------------------------------------------------------------------------------------
> ... that's as far as it gets; it croaks in the "aggregate". Before I put all the
> "gc()" and other diagnostics in, it was croaking with a different error --
> cannot allocate a 15 MB vector.
>
> If you want, I'll zip up the datafile and see how big it is. I'm assuming this
> is something simple that I did wrong, though. I'm going to try dropping the
> extraneous columns before doing the "aggregate"; that might get the object sizes
> down significantly.
>
> ______________________________________________
> R-devel at stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailman/listinfo/r-devel
>
>

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272860 (secr)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595



More information about the R-devel mailing list