[R] Huge data sets and RAM problems

kMan kchamberln at gmail.com
Thu Apr 22 05:30:59 CEST 2010


You set records to NULL perhaps (delete, shift up). Perhaps your system is
susceptible to butterflies on the other side of the world.

Your code may have 'worked' on a small section of data, but the data used
did not include all of the cases needed to fully test your code. So... test
your code!

scan(), used with 'nlines', 'skip', 'sep', and 'what' will cut your read
time by at least half while taking less RAM memory to do it, do most of your
post processing, and give you something to better test your code. Or, don't
use 'nlines' and lose your time/memory benefits over read.table(). 'skip'
will get you "right to the point" before where things failed. That would be
an interesting small segment of data to test with.

wordpad can read your file (and then some). Eventually.

Sincerely,
KeithC.

-----Original Message-----
From: Stella Pachidi [mailto:stella.pachidi at gmail.com] 
Sent: Monday, April 19, 2010 2:07 PM
To: r-help at stat.math.ethz.ch
Subject: [R] Huge data sets and RAM problems

Dear all,

This is the first time I am sending mail to the mailing list, so I hope I do
not make a mistake...

The last months I have been working on my MSc thesis project on performing
data mining techniques on user logs of a software-as-a-service application.
The main problem  I am experiencing is how to process the huge amount of
data. More specifically:

I am using R 2.10.1 in a laptop with Windows 7 - 32bit system, 2GB RAM and
CPU Intel Core Duo 2GHz.

The user logs data come from a query Crystal report (.rpt file) which I
transform with some Java code into a tab separated file.

Although with a small subset of my data everything manages to run, when I
increase the data set I get several problems:

The first problem is with the use of read.delim(). When  I try to read a big
amount of data  (over 2.400.000 rows and 18 attributes at each
row) it doesn't seem to transform all table into a data frame. In
particular, the data frame returned has 1.220.987 rows.

Furthermore, as one of the data attributes is DataTime, when I try to split
this column into two columns (one with Data and one with the Time), the
returned result is quite strange, as the two new columns appear to have more
rows than the data frame:

applicLog.dat <- read.delim("file.txt")
#Process the syscreated column (Date time --> Date + time) copyDate <-
applicLog.dat[["ï..syscreated"]] copyDate <- as.character(copyDate)
splitDate <- strsplit(copyDate, " ") splitDate <- unlist(splitDate)
splitDateIndex <- c(1:length(splitDate)) sysCreatedDate <-
splitDate[splitDateIndex %% 2 == 1] sysCreatedTime <-
splitDate[splitDateIndex %% 2 == 0] sysCreatedDate <-
strptime(sysCreatedDate, format="%Y-%m-%d") op <- options(digits.secs = 3)
sysCreatedTime <- strptime(sysCreatedTime, format ="%H:%M:%OS")
applicLog.dat[["ï..syscreated"]] <- NULL applicLog.dat <- cbind
(sysCreatedDate,sysCreatedTime,applicLog.dat)

Then I get the error: Error in data.frame(..., check.names = FALSE) :
  arguments imply differing number of rows: 1221063, 1221062, 1220987


Finally, another problem I have is when I perform association mining on the
data set using the package arules: I turn the data frame into transactions
table and then run the apriori algorithm. When I put too low support in
order to manage to find the rules I need, the vector of rules becomes too
big and I get problems with the memory such as:
Error: cannot allocate vector of size 923.1 Mb In addition: Warning
messages:
1: In items(x) : Reached total allocation of 153Mb: see help(memory.size)

Could you please help me with how I could allocate more RAM? Or, do you
think there is a way to process the data by loading them into a document
instead of loading all into RAM? Do you know how I could manage to read all
my data set?

I would really appreciate your help.

Kind regards,
Stella Pachidi

PS: Do you know any text editor that can read huge .txt files?





--
Stella Pachidi
Master in Business Informatics student
Utrecht University



More information about the R-help mailing list