[R] R Memory Usage Concerns

Tue Sep 15 05:11:23 CEST 2009

Hello all,

To start with, these measurements are on Linux with R 2.9.2 (64-bit
build) and Python 2.6 (also 64-bit).

I've been investigating R for some log file analysis that I've been
doing. I'm coming at this from the angle of a programmer whose
primarily worked in Python. As I've been playing around with R, I've
noticed that R seems to use a *lot* of memory, especially compared to
Python. Here's an example of what I'm talking about. I have a sample
data file whose characteristics are like this:

[evan at t500 ~]$ ls -lh 20090708.tab
-rw-rw-r-- 1 evan evan 63M 2009-07-08 20:56 20090708.tab

[evan at t500 ~]$ head 20090708.tab
spice 1247036405.04 0.0141088962555
spice 1247036405.01 0.046797990799
spice 1247036405.13 0.0137498378754
spice 1247036404.87 0.0594480037689
spice 1247036405.02 0.0170919895172
topic 1247036404.74 0.512196063995
user_details 1247036404.64 0.242133140564
spice 1247036405.23 0.0408620834351
biz_details 1247036405.04 0.40732884407
spice 1247036405.35 0.0501029491425

[evan at t500 ~]$ wc -l 20090708.tab
1797601 20090708.tab

So it's basically a CSV file (actually, space delimited) where all of
the lines are three columns, a low-cardinality string, a double, and a
double. The file itself is 63M. Python can load all of the data from
the file really compactly (source for the script at the bottom of the
message):

[evan at t500 ~]$ python code/scratch/pymem.py
VIRT = 25230, RSS = 860
VIRT = 81142, RSS = 55825

So this shows that my Python process starts out at 860K RSS memory
before doing any processing, and ends at 55M of RSS memory. This is
pretty good, actually it's better than the size of the file, since a
double can be stored more compactly than the textual data stored in
the data file.

Since I'm new to R I didn't know how to read /proc and so forth, so
instead I launched an R repl and used ps to record the RSS memory
usage before and after running the following statement:

> tab <- read.table("~/20090708.tab")

The numbers I measured were:
VIRT = 176820, RSS = 26180   (just after starting the repl)
VIRT = 414284, RSS = 263708 (after executing the command)

This kind of concerns me. I can understand why R uses more memory at
startup, since it's launching a full repl which my Python script
wasn't doing. But I would have expected the memory usage to not have
grown more like Python did after loading the data. In fact, R ought to
be able to use less memory, since the first column is textual and has
low cardinality (I think 7 distinct values), so storing it as a factor
should be very memory efficient.

For the things that I want to use R for, I know I'll be processing
much larger datasets, and at the rate that R is consuming memory it
may not be possible to fully load the data into memory. I'm concerned
that it may not be worth pursuing learning R if it's possible to load
the data into memory using something like Python but not R. I don't
want to overlook the possibility that I'm overlooking something, since
I'm new to the language. Can anyone answer for me:
 * What is R doing with all of that memory?
 * Is there something I did wrong? Is there a more memory-efficient
way to load this data?
 * Are there R modules that can store large data-sets in a more
memory-efficient way? Can anyone relate their experiences with them?

For reference, here's the Python script I used to measure Python's memory usage:

import os

def show_mem():
	statm = open('/proc/%d/statm' % os.getpid()).read()
	print 'VIRT = %s, RSS = %s' % tuple(statm.split(' ')[:2])

def read_data(fname):
	servlets = []
	timestamps = []
	elapsed = []

	for line in open(fname, 'r'):
		s, t, e = line.strip().split(' ')
		servlets.append(s)
		timestamps.append(float(t))
		elapsed.append(float(e))

	show_mem()

if __name__ == '__main__':
	show_mem()
	read_data('/home/evan/20090708.tab')

--
Evan Klitzke <evan at eklitzke.org> :wq