[R] correlation between rows of data.frame

Eleni Rapsomaniki e.rapsomaniki at mail.cryst.bbk.ac.uk
Fri Aug 1 20:45:09 CEST 2008


Dear R users,

I need to come up with an efficient method to compute the correlation (or at
least, the euclidean distance if that's easier) between specific rows in a data
frame (46,232 rows,    29 columns). The pairs of rows between which I want to
find the correlation share a common value in one of the columns. So for
example,
in the following 
 x=data.frame(id=rep(sample(1:100000,size=10000),2),a=sample(c(NA,rnorm(10,0,1)),size=10000,
replace=T),b=sample(c(NA,rnorm(10,0,1)),size=10000,
replace=T),c=sample(c(NA,rnorm(10,0,1)),size=10000, replace=T))
x$id=factor(x$id)

I would want to compute the correlation between the two rows (for cols a,b,c)
that share the same
id. Using a for loop and dist() works but takes a long time (>1 hour, my RAM is
1Gb):
p=NULL
 for(i in levels(x$id)){p[[i]]=dist(x[x$id==i, -1])}

Is there a more efficient way? I thought about apply/sapply etc but I don't
think they'll work for rows and can't think of an intelligent way to make them
work!
The second problem is that I also need to know how many degrees of freedom (ie
non missing pairs of values) were used in each correlation. Is there a way to
also do this efficiently?

I hope this makes sense! Thank you all very much in advance!

Eleni



More information about the R-help mailing list