[R] Memory usage in read.csv()

nabble.30.miller_2555 at spamgourmet.com nabble.30.miller_2555 at spamgourmet.com
Tue Jan 19 15:25:07 CET 2010


I'm sure this has gotten some attention before, but I have two CSV
files generated from vmstat and free that are roughly 6-8 Mb (about
80,000 lines) each. When I try to use read.csv(), R allocates all
available memory (about 4.9 Gb) when loading the files, which is over
300 times the size of the raw data.  Here are the scripts used to
generate the CSV files as well as the R code:

Scripts (run for roughly a 24-hour period):
    vmstat -ant 1 | awk '$0 !~ /(proc|free)/ {FS=" "; OFS=","; print
strftime("%F %T %Z"),$6,$7,$12,$13,$14,$15,$16,$17;}' >>
~/vmstat_20100118_133845.o;
    free -ms 1 | awk '$0 ~ /Mem\:/ {FS=" "; OFS=","; print
strftime("%F %T %Z"),$2,$3,$4,$5,$6,$7}' >>
~/memfree_20100118_140845.o;

R code:
    infile.vms <- "~/vmstat_20100118_133845.o";
    infile.mem <- "~/memfree_20100118_140845.o";
    vms.colnames <-
c("time","r","b","swpd","free","inact","active","si","so","bi","bo","in","cs","us","sy","id","wa","st");
    vms.colclass <- c("character",rep("integer",length(vms.colnames)-1));
    mem.colnames <- c("time","total","used","free","shared","buffers","cached");
    mem.colclass <- c("character",rep("integer",length(mem.colnames)-1));
    vmsdf <- (read.csv(infile.vms,header=FALSE,colClasses=vms.colclass,col.names=vms.colnames));
    memdf <- (read.csv(infile.mem,header=FALSE,colClasses=mem.colclass,col.names=mem.colnames));

I am running R v2.10.0 on a 64-bit machine with Fedora 10 (Linux
version 2.6.27.41-170.2.117.fc10.x86_64 ) with 6Gb of memory. There
are no other significant programs running and `rm()` followed by `
gc()` successfully frees the memory (followed by swapins after other
programs seek to used previously cached information swapped to disk).
I've incorporated the memory-saving suggestions in the `read.csv()`
manual page, excluding the limit on the lines read (which shouldn't
really be necessary here since we're only talking about < 20 Mb of raw
data. Any suggestions, or is the read.csv() code known to have memory
leak/ overcommit issues?

Thanks



More information about the R-help mailing list