[R] correlation between rows of data.frame

jim holtman jholtman at gmail.com
Sat Aug 2 11:17:49 CEST 2008


You can convert to a numeric matrix to make the operations faster:

>  x=data.frame(id=rep(sample(1:100000,size=10000),2),a=sample(c(NA,rnorm(10,0,1)),size=20000,
+ replace=T),b=sample(c(NA,rnorm(10,0,1)),size=20000,
+ replace=T),c=sample(c(NA,rnorm(10,0,1)),size=20000, replace=T))
> x$id=factor(x$id)
> # convert the numeric data for the 'dist' to a matrix for faster processing
> x.m <- cbind(x$a, x$b, x$c)
> # create a list of indices of each 'id'
> x.i <- split(seq(nrow(x)), x$id)
> system.time(x.d <- sapply(x.i, function(z) dist(x.m[z,])))
   user  system elapsed
   1.56    0.00    1.67
>
>
> str(x.d)
 Named num [1:10000] 1.333 2.522 1.566 3.822 0.905 ...
 - attr(*, "names")= chr [1:10000] "15" "28" "37" "78" ...
>


On Fri, Aug 1, 2008 at 2:45 PM, Eleni Rapsomaniki
<e.rapsomaniki at mail.cryst.bbk.ac.uk> wrote:
> Dear R users,
>
> I need to come up with an efficient method to compute the correlation (or at
> least, the euclidean distance if that's easier) between specific rows in a data
> frame (46,232 rows,    29 columns). The pairs of rows between which I want to
> find the correlation share a common value in one of the columns. So for
> example,
> in the following
>  x=data.frame(id=rep(sample(1:100000,size=10000),2),a=sample(c(NA,rnorm(10,0,1)),size=10000,
> replace=T),b=sample(c(NA,rnorm(10,0,1)),size=10000,
> replace=T),c=sample(c(NA,rnorm(10,0,1)),size=10000, replace=T))
> x$id=factor(x$id)
>
> I would want to compute the correlation between the two rows (for cols a,b,c)
> that share the same
> id. Using a for loop and dist() works but takes a long time (>1 hour, my RAM is
> 1Gb):
> p=NULL
>  for(i in levels(x$id)){p[[i]]=dist(x[x$id==i, -1])}
>
> Is there a more efficient way? I thought about apply/sapply etc but I don't
> think they'll work for rows and can't think of an intelligent way to make them
> work!
> The second problem is that I also need to know how many degrees of freedom (ie
> non missing pairs of values) were used in each correlation. Is there a way to
> also do this efficiently?
>
> I hope this makes sense! Thank you all very much in advance!
>
> Eleni
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?



More information about the R-help mailing list