[Rd] Memory issues in "aggregate" (PR#5829)
znmeb at aracnet.com
znmeb at aracnet.com
Tue Dec 16 00:15:45 MET 2003
Full_Name: Ed Borasky
Version: 1.8.1
OS: Windows XP Professional
Submission from: (NULL) (208.252.96.195)
R 1.8.1 seems to be running into a memory allocation problem in the "aggregate"
function. I have a rather large dataset (14 columns by 223,000 rows -- almost 40
megabytes) and a script that performs some processing on it. The system is a 768
MB Pentium 4. Here's the console log:
---------------------------------------------------------------------------------
R : Copyright 2003, The R Foundation for Statistical Computing
Version 1.8.1 (2003-11-21), ISBN 3-900051-00-3
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for a HTML browser interface to help.
Type 'q()' to quit R.
[Previously saved workspace restored]
> source("script.R",echo=TRUE)
> rm(list = ls())
> cvar <- function(x) sd(x)/mean(x)
> library(sm)
Library `sm', version 2; Copyright (C) 1997, 2000 A.W.Bowman & A.Azzalini
type help(sm) for summary information
> callsperhour <- function(x) length(x)/12
> profiles <- subset(read.csv("profiles.csv"), hourofday >=
7 & hourofday <= 19 & dayofweek >= 1 & dayofweek <= 5)
> nrow(profiles)
[1] 100520
> attach(profiles)
> pseudo.hist <- aggregate(duration, list(Delta), length)
> colnames(pseudo.hist) <- c("Delta", "N")
> detach(profiles)
> gc()
used (Mb) gc trigger (Mb)
Ncells 701188 18.8 2683553 71.7
Vcells 1447712 11.1 8201413 62.6
> memory.profile()
NILSXP SYMSXP LISTSXP CLOSXP ENVSXP PROMSXP LANGSXP
1 7228 244243 3949 495 773 113819
SPECIALSXP BUILTINSXP CHARSXP LGLSXP INTSXP
207 1177 283663 4661 0 0 49
REALSXP CPLXSXP STRSXP DOTSXP ANYSXP VECSXP EXPRSXP
13383 9 24870 0 0 2598 2
BCODESXP EXTPTRSXP WEAKREFSXP
0 93 0
> memory.size(max = TRUE)
[1] 224669696
> memory.size(max = FALSE)
[1] 81072656
> attach(pseudo.hist)
> pseudo.hist <- pseudo.hist[order(as.numeric(as.character(Delta))),
]
> write.table(pseudo.hist, file = "pseudo-hist.csv",
sep = ",", row.names = FALSE)
> detach(pseudo.hist)
> gc()
used (Mb) gc trigger (Mb)
Ncells 701228 18.8 2146842 57.4
Vcells 1447740 11.1 5248904 40.1
> memory.profile()
NILSXP SYMSXP LISTSXP CLOSXP ENVSXP PROMSXP LANGSXP
1 7237 244261 3949 495 773 113819
SPECIALSXP BUILTINSXP CHARSXP LGLSXP INTSXP
207 1177 283672 4661 0 0 49
REALSXP CPLXSXP STRSXP DOTSXP ANYSXP VECSXP EXPRSXP
13383 9 24870 0 0 2598 2
BCODESXP EXTPTRSXP WEAKREFSXP
0 93 0
> memory.size(max = TRUE)
[1] 224669696
> memory.size(max = FALSE)
[1] 81072656
> attach(profiles)
> cphs.site <- aggregate(Timestamp, list(CNumber, localdate),
callsperhour)
> colnames(cphs.site) <- c("CNumber", "localdate", "CallsPerHour")
> detach(profiles)
> gc()
used (Mb) gc trigger (Mb)
Ncells 701695 18.8 2146842 57.4
Vcells 1449346 11.1 5248904 40.1
> memory.profile()
NILSXP SYMSXP LISTSXP CLOSXP ENVSXP PROMSXP LANGSXP
1 7240 244277 3949 495 773 113819
SPECIALSXP BUILTINSXP CHARSXP LGLSXP INTSXP
207 1177 284109 4661 0 0 51
REALSXP CPLXSXP STRSXP DOTSXP ANYSXP VECSXP EXPRSXP
13384 9 24877 0 0 2599 2
BCODESXP EXTPTRSXP WEAKREFSXP
0 93 0
> memory.size(max = TRUE)
[1] 224669696
> memory.size(max = FALSE)
[1] 82444104
> attach(cphs.site)
> cphs.site <- cphs.site[order(CNumber, localdate),
]
> write.table(cphs.site, file = "cphs-site.csv", sep = ",",
row.names = FALSE)
> detach(cphs.site)
> gc()
used (Mb) gc trigger (Mb)
Ncells 701701 18.8 2146842 57.4
Vcells 1449350 11.1 5248904 40.1
> memory.profile()
NILSXP SYMSXP LISTSXP CLOSXP ENVSXP PROMSXP LANGSXP
1 7242 244279 3949 495 773 113819
SPECIALSXP BUILTINSXP CHARSXP LGLSXP INTSXP
207 1177 284111 4661 0 0 51
REALSXP CPLXSXP STRSXP DOTSXP ANYSXP VECSXP EXPRSXP
13384 9 24877 0 0 2599 2
BCODESXP EXTPTRSXP WEAKREFSXP
0 93 0
> memory.size(max = TRUE)
[1] 224669696
> memory.size(max = FALSE)
[1] 82444104
> attach(profiles)
> cphs <- aggregate(Timestamp, list(CNumber, IP, localdate),
callsperhour)
Error in makeRestartList(...) : evaluation is nested too deeply: infinite
recursion?
>
------------------------------------------------------------------------------------
"profiles.csv" is the 40 MB file. Here's the R code that generates the error:
------------------------------------------------------------------------------------
# keep a log file
#sink ("script.log")
# clean house
rm (list=ls())
# definitions, libraries
cvar<-function(x) sd(x)/mean(x); # coefficient of variation
library(sm)
callsperhour<-function(x) length(x)/12
# load data
profiles<-subset(read.csv("profiles.csv"),
#as.character(localdate)<"2003-07-19"
#&hourofday>=7
hourofday>=7
&hourofday<=19
&dayofweek>=1
&dayofweek<=5)
#profiles<-subset(profiles,!(localdate=="2003-07-11"&CNumber=="C132185"))
nrow(profiles)
# compute pseudo-histogram
attach(profiles)
pseudo.hist<-aggregate(duration,list(Delta),length)
colnames(pseudo.hist)<-c("Delta", "N")
detach(profiles)
gc()
memory.profile()
memory.size(max=TRUE)
memory.size(max=FALSE)
attach (pseudo.hist)
pseudo.hist<-pseudo.hist[order(as.numeric(as.character(Delta))),]
#print (pseudo.hist)
write.table (pseudo.hist, file="pseudo-hist.csv", sep=",",
row.names=FALSE)
detach(pseudo.hist)
gc()
memory.profile()
memory.size(max=TRUE)
memory.size(max=FALSE)
# compute calls per hour for each site/date combo
attach(profiles)
cphs.site<-aggregate(Timestamp,list(CNumber,localdate),callsperhour)
colnames(cphs.site)<-c("CNumber","localdate","CallsPerHour")
detach(profiles)
gc()
memory.profile()
memory.size(max=TRUE)
memory.size(max=FALSE)
attach(cphs.site)
cphs.site<-cphs.site[order(CNumber,localdate),]
#print (cphs.site)
write.table (cphs.site, file="cphs-site.csv", sep=",",
row.names=FALSE)
detach(cphs.site)
gc()
memory.profile()
memory.size(max=TRUE)
memory.size(max=FALSE)
# compute calls per hour for each site/IP/date combo
attach(profiles)
cphs<-aggregate(Timestamp,list(CNumber,IP,localdate),callsperhour)
colnames(cphs)<-c("CNumber","IP","localdate","CallsPerHour")
detach(profiles)
gc()
memory.profile()
memory.size(max=TRUE)
memory.size(max=FALSE)
------------------------------------------------------------------------------------
... that's as far as it gets; it croaks in the "aggregate". Before I put all the
"gc()" and other diagnostics in, it was croaking with a different error --
cannot allocate a 15 MB vector.
If you want, I'll zip up the datafile and see how big it is. I'm assuming this
is something simple that I did wrong, though. I'm going to try dropping the
extraneous columns before doing the "aggregate"; that might get the object sizes
down significantly.
More information about the R-devel
mailing list