[R] Huge data sets and RAM problems

Mon Apr 19 22:07:03 CEST 2010

Dear all,

This is the first time I am sending mail to the mailing list, so I
hope I do not make a mistake...

The last months I have been working on my MSc thesis project on
performing data mining techniques on user logs of a
software-as-a-service application. The main problem  I am experiencing
is how to process the huge amount of data. More specifically:

I am using R 2.10.1 in a laptop with Windows 7 - 32bit system, 2GB RAM
and CPU Intel Core Duo 2GHz.

The user logs data come from a query Crystal report (.rpt file) which
I transform with some Java code into a tab separated file.

Although with a small subset of my data everything manages to run,
when I increase the data set I get several problems:

The first problem is with the use of read.delim(). When  I try to read
a big amount of data  (over 2.400.000 rows and 18 attributes at each
row) it doesn't seem to transform all table into a data frame. In
particular, the data frame returned has 1.220.987 rows.

Furthermore, as one of the data attributes is DataTime, when I try to
split this column into two columns (one with Data and one with the
Time), the returned result is quite strange, as the two new columns
appear to have more rows than the data frame:

applicLog.dat <- read.delim("file.txt")
#Process the syscreated column (Date time --> Date + time)
copyDate <- applicLog.dat[["ï..syscreated"]]
copyDate <- as.character(copyDate)
splitDate <- strsplit(copyDate, " ")
splitDate <- unlist(splitDate)
splitDateIndex <- c(1:length(splitDate))
sysCreatedDate <- splitDate[splitDateIndex %% 2 == 1]
sysCreatedTime <- splitDate[splitDateIndex %% 2 == 0]
sysCreatedDate <- strptime(sysCreatedDate, format="%Y-%m-%d")
op <- options(digits.secs = 3)
sysCreatedTime <- strptime(sysCreatedTime, format ="%H:%M:%OS")
applicLog.dat[["ï..syscreated"]] <- NULL
applicLog.dat <- cbind (sysCreatedDate,sysCreatedTime,applicLog.dat)

Then I get the error: Error in data.frame(..., check.names = FALSE) :
  arguments imply differing number of rows: 1221063, 1221062, 1220987

Finally, another problem I have is when I perform association mining
on the data set using the package arules: I turn the data frame into
transactions table and then run the apriori algorithm. When I put too
low support in order to manage to find the rules I need, the vector of
rules becomes too big and I get problems with the memory such as:
Error: cannot allocate vector of size 923.1 Mb
In addition: Warning messages:
1: In items(x) : Reached total allocation of 153Mb: see help(memory.size)

Could you please help me with how I could allocate more RAM? Or, do
you think there is a way to process the data by loading them into a
document instead of loading all into RAM? Do you know how I could
manage to read all my data set?

I would really appreciate your help.

Kind regards,
Stella Pachidi

PS: Do you know any text editor that can read huge .txt files?

--
Stella Pachidi
Master in Business Informatics student
Utrecht University