[R] Memory usage in read.csv()

Thu Jan 21 06:39:08 CET 2010

Hi Jim & Gabor -

   Apparently, it was most likely a hardware issue (shortly after
sending my last e-mail, the computer promptly died). After buying a
new system and restoring, the script runs fine. Thanks for your help!

On Tue, Jan 19, 2010 at 2:02 PM, jim holtman - jholtman at gmail.com
<+nabble+miller_2555+9dc9649aca.jholtman#gmail.com at spamgourmet.com>
wrote:
> I read vmstat data in just fine without any problems.  Here is an
> example of how I do it:
>
> VMstat <- read.table('vmstat.txt', header=TRUE, as.is=TRUE)
>
> vmstat.txt looks like this:
>
> date time r b w swap free re mf pi po fr de sr intr syscalls cs user sys id
> 07/27/05 00:13:06 0 0 0 27755440 13051648 20 86 0 0 0 0 0 456 2918 1323 0 1 99
> 07/27/05 00:13:36 0 0 0 27755280 13051480 11 53 0 0 0 0 0 399 1722 1411 0 1 99
> 07/27/05 00:14:06 0 0 0 27753952 13051248 18 88 0 0 0 0 0 424 1259 1254 0 1 99
> 07/27/05 00:14:36 0 0 0 27755304 13051496 17 85 0 0 0 0 0 430 1029 1246 0 1 99
> 07/27/05 00:15:06 0 0 0 27755064 13051232 41 278 0 1 1 0 0 452 2047 1386 0 1 99
> 07/27/05 00:15:36 0 0 0 27753824 13040720 125 1039 0 0 0 0 0 664 4097
> 1901 3 2 95
> 07/27/05 00:16:06 0 0 0 27754472 13027000 15 91 0 0 0 0 0 432 1160 1273 0 1 99
> 07/27/05 00:16:36 0 0 0 27754568 13027104 17 85 0 0 0 0 0 416 1058 1271 0 1 99
>
> Have you tried a smaller portion of data?
>
> Here is what it took to read in a file with 85K lines:
>
>> system.time(vmstat <- read.table('c:/vmstat.txt', header=TRUE))
>   user  system elapsed
>   2.01    0.01    2.03
>> str(vmstat)
> 'data.frame':   85680 obs. of  20 variables:
>  $ date    : Factor w/ 2 levels "07/27/05","07/28/05": 1 1 1 1 1 1 1 1 1 1 ...
>  $ time    : Factor w/ 2856 levels "00:00:26","00:00:56",..: 27 29 31
> 33 35 37 39 41 43 45 ...
>  $ r       : int  0 0 0 0 0 0 0 0 0 0 ...
>  $ b       : int  0 0 0 0 0 0 0 0 0 0 ...
>  $ w       : int  0 0 0 0 0 0 0 0 0 0 ...
>  $ swap    : int  27755440 27755280 27753952 27755304 27755064
> 27753824 27754472 27754568 27754560 27754704 ...
>  $ free    : int  13051648 13051480 13051248 13051496 13051232
> 13040720 13027000 13027104 13027096 13027240 ...
>  $ re      : int  20 11 18 17 41 125 15 17 13 12 ...
>  $ mf      : int  86 53 88 85 278 1039 91 85 69 51 ...
>  $ pi      : int  0 0 0 0 0 0 0 0 0 0 ...
>  $ po      : int  0 0 0 0 1 0 0 0 0 1 ...
>  $ fr      : int  0 0 0 0 1 0 0 0 0 1 ...
>  $ de      : int  0 0 0 0 0 0 0 0 0 0 ...
>  $ sr      : int  0 0 0 0 0 0 0 0 0 0 ...
>  $ intr    : int  456 399 424 430 452 664 432 416 425 432 ...
>  $ syscalls: int  2918 1722 1259 1029 2047 4097 1160 1058 1198 1727 ...
>  $ cs      : int  1323 1411 1254 1246 1386 1901 1273 1271 1268 1477 ...
>  $ user    : int  0 0 0 0 0 3 0 0 0 0 ...
>  $ sys     : int  1 1 1 1 1 2 1 1 1 1 ...
>  $ id      : int  99 99 99 99 99 95 99 99 99 99 ...
>>
>
>
> On Tue, Jan 19, 2010 at 9:25 AM, <nabble.30.miller_2555 at spamgourmet.com> wrote:
>>
>> I'm sure this has gotten some attention before, but I have two CSV
>> files generated from vmstat and free that are roughly 6-8 Mb (about
>> 80,000 lines) each. When I try to use read.csv(), R allocates all
>> available memory (about 4.9 Gb) when loading the files, which is over
>> 300 times the size of the raw data.  Here are the scripts used to
>> generate the CSV files as well as the R code:
>>
>> Scripts (run for roughly a 24-hour period):
>>    vmstat -ant 1 | awk '$0 !~ /(proc|free)/ {FS=" "; OFS=","; print
>> strftime("%F %T %Z"),$6,$7,$12,$13,$14,$15,$16,$17;}' >>
>> ~/vmstat_20100118_133845.o;
>>    free -ms 1 | awk '$0 ~ /Mem\:/ {FS=" "; OFS=","; print
>> strftime("%F %T %Z"),$2,$3,$4,$5,$6,$7}' >>
>> ~/memfree_20100118_140845.o;
>>
>> R code:
>>    infile.vms <- "~/vmstat_20100118_133845.o";
>>    infile.mem <- "~/memfree_20100118_140845.o";
>>    vms.colnames <-
>> c("time","r","b","swpd","free","inact","active","si","so","bi","bo","in","cs","us","sy","id","wa","st");
>>    vms.colclass <- c("character",rep("integer",length(vms.colnames)-1));
>>    mem.colnames <- c("time","total","used","free","shared","buffers","cached");
>>    mem.colclass <- c("character",rep("integer",length(mem.colnames)-1));
>>    vmsdf <- (read.csv(infile.vms,header=FALSE,colClasses=vms.colclass,col.names=vms.colnames));
>>    memdf <- (read.csv(infile.mem,header=FALSE,colClasses=mem.colclass,col.names=mem.colnames));
>>
>> I am running R v2.10.0 on a 64-bit machine with Fedora 10 (Linux
>> version 2.6.27.41-170.2.117.fc10.x86_64 ) with 6Gb of memory. There
>> are no other significant programs running and `rm()` followed by `
>> gc()` successfully frees the memory (followed by swapins after other
>> programs seek to used previously cached information swapped to disk).
>> I've incorporated the memory-saving suggestions in the `read.csv()`
>> manual page, excluding the limit on the lines read (which shouldn't
>> really be necessary here since we're only talking about < 20 Mb of raw
>> data. Any suggestions, or is the read.csv() code known to have memory
>> leak/ overcommit issues?
>>
>> Thanks
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>
>
> --
> Jim Holtman
> Cincinnati, OH
> +1 513 646 9390
>
> What is the problem that you are trying to solve?
>
>