[Rd] Memory issues in "aggregate" (PR#5829)

znmeb at aracnet.com znmeb at aracnet.com
Tue Dec 16 00:15:45 MET 2003


Full_Name: Ed Borasky
Version: 1.8.1
OS: Windows XP Professional
Submission from: (NULL) (208.252.96.195)


R 1.8.1 seems to be running into a memory allocation problem in the "aggregate"
function. I have a rather large dataset (14 columns by 223,000 rows -- almost 40
megabytes) and a script that performs some processing on it. The system is a 768
MB Pentium 4. Here's the console log:

---------------------------------------------------------------------------------
R : Copyright 2003, The R Foundation for Statistical Computing
Version 1.8.1  (2003-11-21), ISBN 3-900051-00-3

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for a HTML browser interface to help.
Type 'q()' to quit R.

[Previously saved workspace restored]

> source("script.R",echo=TRUE)

> rm(list = ls())

> cvar <- function(x) sd(x)/mean(x)

> library(sm)
Library `sm', version 2; Copyright (C) 1997, 2000 A.W.Bowman & A.Azzalini
type help(sm) for summary information

> callsperhour <- function(x) length(x)/12

> profiles <- subset(read.csv("profiles.csv"), hourofday >= 
    7 & hourofday <= 19 & dayofweek >= 1 & dayofweek <= 5)

> nrow(profiles)
[1] 100520

> attach(profiles)

> pseudo.hist <- aggregate(duration, list(Delta), length)

> colnames(pseudo.hist) <- c("Delta", "N")

> detach(profiles)

> gc()
          used (Mb) gc trigger (Mb)
Ncells  701188 18.8    2683553 71.7
Vcells 1447712 11.1    8201413 62.6

> memory.profile()
    NILSXP     SYMSXP    LISTSXP     CLOSXP     ENVSXP    PROMSXP    LANGSXP 
         1       7228     244243       3949        495        773     113819 
SPECIALSXP BUILTINSXP    CHARSXP     LGLSXP                           INTSXP 
       207       1177     283663       4661          0          0         49 
   REALSXP    CPLXSXP     STRSXP     DOTSXP     ANYSXP     VECSXP    EXPRSXP 
     13383          9      24870          0          0       2598          2 
  BCODESXP  EXTPTRSXP WEAKREFSXP 
         0         93          0 

> memory.size(max = TRUE)
[1] 224669696

> memory.size(max = FALSE)
[1] 81072656

> attach(pseudo.hist)

> pseudo.hist <- pseudo.hist[order(as.numeric(as.character(Delta))), 
    ]

> write.table(pseudo.hist, file = "pseudo-hist.csv", 
    sep = ",", row.names = FALSE)

> detach(pseudo.hist)

> gc()
          used (Mb) gc trigger (Mb)
Ncells  701228 18.8    2146842 57.4
Vcells 1447740 11.1    5248904 40.1

> memory.profile()
    NILSXP     SYMSXP    LISTSXP     CLOSXP     ENVSXP    PROMSXP    LANGSXP 
         1       7237     244261       3949        495        773     113819 
SPECIALSXP BUILTINSXP    CHARSXP     LGLSXP                           INTSXP 
       207       1177     283672       4661          0          0         49 
   REALSXP    CPLXSXP     STRSXP     DOTSXP     ANYSXP     VECSXP    EXPRSXP 
     13383          9      24870          0          0       2598          2 
  BCODESXP  EXTPTRSXP WEAKREFSXP 
         0         93          0 

> memory.size(max = TRUE)
[1] 224669696

> memory.size(max = FALSE)
[1] 81072656

> attach(profiles)

> cphs.site <- aggregate(Timestamp, list(CNumber, localdate), 
    callsperhour)

> colnames(cphs.site) <- c("CNumber", "localdate", "CallsPerHour")

> detach(profiles)

> gc()
          used (Mb) gc trigger (Mb)
Ncells  701695 18.8    2146842 57.4
Vcells 1449346 11.1    5248904 40.1

> memory.profile()
    NILSXP     SYMSXP    LISTSXP     CLOSXP     ENVSXP    PROMSXP    LANGSXP 
         1       7240     244277       3949        495        773     113819 
SPECIALSXP BUILTINSXP    CHARSXP     LGLSXP                           INTSXP 
       207       1177     284109       4661          0          0         51 
   REALSXP    CPLXSXP     STRSXP     DOTSXP     ANYSXP     VECSXP    EXPRSXP 
     13384          9      24877          0          0       2599          2 
  BCODESXP  EXTPTRSXP WEAKREFSXP 
         0         93          0 

> memory.size(max = TRUE)
[1] 224669696

> memory.size(max = FALSE)
[1] 82444104

> attach(cphs.site)

> cphs.site <- cphs.site[order(CNumber, localdate), 
    ]

> write.table(cphs.site, file = "cphs-site.csv", sep = ",", 
    row.names = FALSE)

> detach(cphs.site)

> gc()
          used (Mb) gc trigger (Mb)
Ncells  701701 18.8    2146842 57.4
Vcells 1449350 11.1    5248904 40.1

> memory.profile()
    NILSXP     SYMSXP    LISTSXP     CLOSXP     ENVSXP    PROMSXP    LANGSXP 
         1       7242     244279       3949        495        773     113819 
SPECIALSXP BUILTINSXP    CHARSXP     LGLSXP                           INTSXP 
       207       1177     284111       4661          0          0         51 
   REALSXP    CPLXSXP     STRSXP     DOTSXP     ANYSXP     VECSXP    EXPRSXP 
     13384          9      24877          0          0       2599          2 
  BCODESXP  EXTPTRSXP WEAKREFSXP 
         0         93          0 

> memory.size(max = TRUE)
[1] 224669696

> memory.size(max = FALSE)
[1] 82444104

> attach(profiles)

> cphs <- aggregate(Timestamp, list(CNumber, IP, localdate), 
    callsperhour)
Error in makeRestartList(...) : evaluation is nested too deeply: infinite
recursion?
> 
------------------------------------------------------------------------------------
"profiles.csv" is the 40 MB file. Here's the R code that generates the error:
------------------------------------------------------------------------------------
# keep a log file
#sink ("script.log")

# clean house
rm (list=ls())

# definitions, libraries
cvar<-function(x) sd(x)/mean(x); # coefficient of variation
library(sm)
callsperhour<-function(x) length(x)/12

# load data
profiles<-subset(read.csv("profiles.csv"),
	#as.character(localdate)<"2003-07-19"
	#&hourofday>=7
	hourofday>=7
	&hourofday<=19
	&dayofweek>=1
	&dayofweek<=5)
#profiles<-subset(profiles,!(localdate=="2003-07-11"&CNumber=="C132185"))
nrow(profiles)

# compute pseudo-histogram
attach(profiles)
pseudo.hist<-aggregate(duration,list(Delta),length)
colnames(pseudo.hist)<-c("Delta", "N")
detach(profiles)
gc()
memory.profile()
memory.size(max=TRUE)
memory.size(max=FALSE)

attach (pseudo.hist)
pseudo.hist<-pseudo.hist[order(as.numeric(as.character(Delta))),]
#print (pseudo.hist)
write.table (pseudo.hist, file="pseudo-hist.csv", sep=",",
	row.names=FALSE)
detach(pseudo.hist)
gc()
memory.profile()
memory.size(max=TRUE)
memory.size(max=FALSE)

# compute calls per hour for each site/date combo
attach(profiles)
cphs.site<-aggregate(Timestamp,list(CNumber,localdate),callsperhour)
colnames(cphs.site)<-c("CNumber","localdate","CallsPerHour")
detach(profiles)
gc()
memory.profile()
memory.size(max=TRUE)
memory.size(max=FALSE)

attach(cphs.site)
cphs.site<-cphs.site[order(CNumber,localdate),]
#print (cphs.site)
write.table (cphs.site, file="cphs-site.csv", sep=",",
	row.names=FALSE)
detach(cphs.site)
gc()
memory.profile()
memory.size(max=TRUE)
memory.size(max=FALSE)

# compute calls per hour for each site/IP/date combo
attach(profiles)
cphs<-aggregate(Timestamp,list(CNumber,IP,localdate),callsperhour)
colnames(cphs)<-c("CNumber","IP","localdate","CallsPerHour")
detach(profiles)
gc()
memory.profile()
memory.size(max=TRUE)
memory.size(max=FALSE)
------------------------------------------------------------------------------------
... that's as far as it gets; it croaks in the "aggregate". Before I put all the
"gc()" and other diagnostics in, it was croaking with a different error --
cannot allocate a 15 MB vector.

If you want, I'll zip up the datafile and see how big it is. I'm assuming this
is something simple that I did wrong, though. I'm going to try dropping the
extraneous columns before doing the "aggregate"; that might get the object sizes
down significantly.



More information about the R-devel mailing list