[R] handling big data set in R
shu zhang
szhang.r at gmail.com
Mon Mar 3 06:35:01 CET 2008
Hello R users,
I'm wondering whether it is possible to manage big data set in R? I
have a data set with 3 million rows and 3 columns (X,Y,Z), where X is
the group id. For each X, I need to run 2 regression on the submatrix.
I used the function "split":
datamatrix<-read.csv("datas.csv", header=F, sep=",")
dim(datamatrix)
# [1] 2980523 3
names(datamatrix)<-c("X","Y","Z")
attach(datamatrix)
subX<-split(X, X)
subY<-split(Y,X)
subZ<-split(Z,X)
n<-length(subdata) ### number of groups
s1<-s2<-rep(NA, n) ### vector to store the regression slope
for (i in 1:n){
a<-table(Y[[i]])
table.x<-as.numeric(names(a))
table.y<-as.numeric(a)
fit1<-lm(table.y~table.x) ##### find the slope of the histogram of y
s1[i]<-fit$coefficients[2]
fit2<-lm(subY[[i]]~subZ[[i]]) ####### regress y on z
s2[i]<-fit$coefficients[2]
}
But my R died before completing the loop... (I've thought about doing
it in SAS, but I don't know how to write a loop combined with a PROC
REG...)
One thing that might be helpful is that my data set has already been
sorted based on X. I don't know whether this can be any helpful for
managing the dataset.
Any suggestion would be appreciated!
Thanks!
-Shu
More information about the R-help
mailing list