[R] Large Data Set Help

Roland Rau roland.rproject at gmail.com
Mon Aug 25 22:47:16 CEST 2008


Hi,

Jason Thibodeau wrote:
> I am attempting to perform some simple data manipulation on a large data
> set. I have a snippet of the whole data set, and my small snippet is 2GB in
> CSV.
> 
> Is there a way I can read my csv, select a few columns, and write it to an
> output file in real time? This is what I do right now to a small test file:
> 
> data <- read.csv('data.csv', header = FALSE)
> 
> data_filter <- data[c(1,3,4)]
> 
> write.table(data_filter, file = "filter_data.csv", sep = ",", row.names =
> FALSE, col.names = FALSE)

in this case, I think R is not the best tool for the job. I would rather 
suggest to use an implementation of the awk language (e.g. gawk).
I just tried the following on WinXP (zipped file (87MB zipped, 1.2GB 
unzipped), piped into gawk)
unzip -p myzipfile.zip | gawk '{print $1, $3, $4}' > myfiltereddata.txt
and it took about 90 seconds.

Please note that you might need to specify your delimiter (field 
separator (FS) and output field separator (OFS)) =>
gawk '{FS=","; OFS=","} {print $1, $3, $4}' data.csv > filter_data.scv

I hope this helps (despite not encouraging the usage of R),
Roland



More information about the R-help mailing list