[R] Pearson corelation and p-value for matrix

Tue Apr 19 02:33:07 CEST 2005

> From: John Fox 
> 
> Dear Andy,
> 
> Very nice! (My point was that if this is a one-time thing, for Dren to
> puzzle over it is probably more time-consuming than simply doing it
> inefficiently.)

There's a memory/speed tradeoff.  The loop avoids creating multiple
6000x6000 matrices that my version would require, and one of those would
take up 274MB!  I could not run my function on a 6000x6000 case on my laptop
(nor rcorr).  The example I showed was 2000x2000.  Also, as Frank pointed
out, it's not flexible.  As John said, it's good for one-time use.  For more
general consumption, one would want to write more flexible and robust code,
in which case one might very well re-invent rcorr...

Cheers,
Andy

> Regards,
>  John
> 
> --------------------------------
> John Fox
> Department of Sociology
> McMaster University
> Hamilton, Ontario
> Canada L8S 4M4
> 905-525-9140x23604
> http://socserv.mcmaster.ca/jfox 
> -------------------------------- 
> 
> > -----Original Message-----
> > From: Liaw, Andy [mailto:andy_liaw at merck.com] 
> > Sent: Monday, April 18, 2005 2:40 PM
> > To: 'John Fox'; 'Dren Scott'
> > Cc: 'R-Help'
> > Subject: RE: [R] Pearson corelation and p-value for matrix
> > 
> > I believe this will do:
> > 
> > cor.pval2 <- function(x,  alternative="two-sided") {
> >     corMat <- cor(x, use=if (any(is.na(x))) 
> > "pairwise.complete.obs" else
> > "all")
> >     df <- crossprod(!is.na(x)) - 2
> >     STATISTIC <- sqrt(df) * corMat / sqrt(1 - corMat^2)
> >     p <- pt(STATISTIC, df)
> >     p <- if (alternative == "less") {
> >         p
> >     } else if (alternative == "greater") {
> >         1 - p
> >     } else 2 * pmin(p, 1 - p)
> >     p
> > }
> > 
> > Some test:
> > 
> > > set.seed(1)
> > > x <- matrix(runif(2e3 * 1e2), 2e3)
> > > system.time(res1 <- cor.pval(t(x)), gcFirst=TRUE)
> > [1] 17.28  0.77 18.16    NA    NA
> > > system.time(res2 <- cor.pval2(t(x)), gcFirst=TRUE)
> > [1] 19.51  1.05 20.70    NA    NA
> > > max(abs(res1 - res2))
> > [1] 0
> > > x[c(1, 3, 6), c(2, 4)] <- NA
> > > x[30, 61] <- NA
> > > system.time(res2 <- cor.pval2(t(x)), gcFirst=TRUE)
> > [1] 24.48  0.71 25.28    NA    NA
> > 
> > This is a bit slower because of the extra computation for 
> > "df".  One can try to save some computation by only computing 
> > with the lower (or upper) triangular part.
> > 
> > Cheers,
> > Andy
> > 
> > > From: John Fox
> > > 
> > > Dear Dren,
> > > 
> > > Since cor(), on which Andy's solution is based, can compute 
> > > pairwise-present correlations, you could adapt his function 
> > -- you'll 
> > > have to adjust the df for each pair. Alternatively, you 
> > could probably 
> > > save some time (programming time + computer time) by just 
> using my 
> > > solution:
> > > 
> > > > R <- diag(100)
> > > > R[upper.tri(R)] <- R[lower.tri(R)] <- .5
> > > > library(mvtnorm)
> > > > X <- rmvnorm(6000, sigma=R)
> > > > system.time(for (i in 1:50) cor.pvalues(X), gc=TRUE)
> > > [1] 518.19   1.11 520.23     NA     NA
> > > 
> > > I know that time is money, but nine minutes (on my machine) 
> > probably 
> > > won't bankrupt anyone.
> > > 
> > > Regards,
> > >  John
> > > 
> > > --------------------------------
> > > John Fox
> > > Department of Sociology
> > > McMaster University
> > > Hamilton, Ontario
> > > Canada L8S 4M4
> > > 905-525-9140x23604
> > > http://socserv.mcmaster.ca/jfox
> > > --------------------------------
> > > 
> > > > -----Original Message-----
> > > > From: r-help-bounces at stat.math.ethz.ch 
> > > > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of 
> Dren Scott
> > > > Sent: Monday, April 18, 2005 12:41 PM
> > > > To: 'R-Help'
> > > > Subject: RE: [R] Pearson corelation and p-value for matrix
> > > > 
> > > > Hi all,
> > > >  
> > > > Thanks Andy, Mark and John for all the help. I really 
> > appreciate it. 
> > > > I'm new to both R and statistics, so please excuse any 
> > gaffes on my 
> > > > part.
> > > >  
> > > > Essentially what I'm trying to do, is to evaluate for 
> > each row, how 
> > > > many other rows would have a p-value < 0.05. So, after I 
> > get my N x 
> > > > N p-value matrix, I'll just filter out values that are > 0.05.
> > > >  
> > > > Each of my datasets (6000 rows x 100 columns) would 
> > consist of some 
> > > > NA's. The iterative procedure (cor.pvalues) proposed by 
> > John would 
> > > > yield the values, but it would take an inordinately 
> long time (I 
> > > > have 50 of these datasets to process). The solution proposed by 
> > > > Andy, although fast, would not be able to incorporate the NA's.
> > > >  
> > > > Is there any workaround for the NA's? Or possibly do 
> you think I 
> > > > could try something else?
> > > >  
> > > > Thanks very much. 
> > > >  
> > > > Dren
> > > > 
> > > > 
> > > > "Liaw, Andy" <andy_liaw at merck.com> wrote:
> > > > > From: John Fox
> > > > > 
> > > > > Dear Andy,
> > > > > 
> > > > > That's clearly much better -- and illustrates an
> > > effective strategy
> > > > > for vectorizing (or "matricizing") a computation. I think
> > > I'll add
> > > > > this to my list of programming examples. It might be a little 
> > > > > dangerous to pass ...
> > > > > through to cor(), since someone could specify
> > > type="spearman", for
> > > > > example.
> > > > 
> > > > Ah, yes, the "..." isn't likely to help here! Also, it 
> will only 
> > > > work correctly if there are no NA's, for example (or else 
> > the degree 
> > > > of freedom would be wrong).
> > > > 
> > > > Best,
> > > > Andy
> > > > 
> > > > > Thanks,
> > > > > John
> > > > > 
> > > > > --------------------------------
> > > > > John Fox
> > > > > Department of Sociology
> > > > > McMaster University
> > > > > Hamilton, Ontario
> > > > > Canada L8S 4M4
> > > > > 905-525-9140x23604
> > > > > http://socserv.mcmaster.ca/jfox
> > > > > --------------------------------
> > > > > 
> > > > > > -----Original Message-----
> > > > > > From: Liaw, Andy [mailto:andy_liaw at merck.com]
> > > > > > Sent: Friday, April 15, 2005 9:51 PM
> > > > > > To: 'John Fox'; MSchwartz at medanalytics.com
> > > > > > Cc: 'R-Help'; 'Dren Scott'
> > > > > > Subject: RE: [R] Pearson corelation and p-value for matrix
> > > > > > 
> > > > > > We can be a bit sneaky and `borrow' code from 
> > cor.test.default:
> > > > > > 
> > > > > > cor.pval <- function(x, alternative="two-sided", ...) {
> > > corMat <-
> > > > > > cor(x, ...) n <- nrow(x) df <- n - 2 STATISTIC <-
> > > > sqrt(df) * corMat
> > > > > > / sqrt(1 - corMat^2) p <- pt(STATISTIC, df) p <- if
> > > > (alternative ==
> > > > > > "less") { p } else if (alternative == "greater") {
> > > > > > 1 - p
> > > > > > } else 2 * pmin(p, 1 - p)
> > > > > > p
> > > > > > }
> > > > > > 
> > > > > > The test:
> > > > > > 
> > > > > > > system.time(c1 <- cor.pvals(X), gcFirst=TRUE)
> > > > > > [1] 13.19 0.01 13.58 NA NA
> > > > > > > system.time(c2 <- cor.pvalues(X), gcFirst=TRUE)
> > > > > > [1] 6.22 0.00 6.42 NA NA
> > > > > > > system.time(c3 <- cor.pval(X), gcFirst=TRUE)
> > > > > > [1] 0.07 0.00 0.07 NA NA
> > > > > > 
> > > > > > Cheers,
> > > > > > Andy
> > > > > > 
> > > > > > > From: John Fox
> > > > > > > 
> > > > > > > Dear Mark,
> > > > > > > 
> > > > > > > I think that the reflex of trying to avoid loops in R
> > > is often
> > > > > > > mistaken, and so I decided to try to time the two
> > > > > approaches (on a
> > > > > > > 3GHz Windows XP system).
> > > > > > > 
> > > > > > > I discovered, first, that there is a bug in your
> > > > function -- you
> > > > > > > appear to have indexed rows instead of columns; 
> fixing that:
> > > > > > > 
> > > > > > > cor.pvals <- function(mat)
> > > > > > > {
> > > > > > > cols <- expand.grid(1:ncol(mat), 1:ncol(mat))
> > > > matrix(apply(cols,
> > > > > > > 1,
> > > > > > > function(x) cor.test(mat[, x[1]], mat[,
> > > x[2]])$p.value), ncol =
> > > > > > > ncol(mat)) }
> > > > > > > 
> > > > > > > 
> > > > > > > My function is cor.pvalues and yours cor.pvals. This is
> > > > > for a data
> > > > > > > matrix with 1000 observations on 100 variables:
> > > > > > > 
> > > > > > > > R <- diag(100)
> > > > > > > > R[upper.tri(R)] <- R[lower.tri(R)] <- .5
> > > > > > > > library(mvtnorm)
> > > > > > > > X <- rmvnorm(1000, sigma=R)
> > > > > > > > dim(X)
> > > > > > > [1] 1000 100
> > > > > > > > 
> > > > > > > > system.time(cor.pvalues(X))
> > > > > > > [1] 5.53 0.00 5.53 NA NA
> > > > > > > > 
> > > > > > > > system.time(cor.pvals(X))
> > > > > > > [1] 12.66 0.00 12.66 NA NA
> > > > > > > > 
> > > > > > > 
> > > > > > > I frankly didn't expect the advantage of my approach to be
> > > > > > this large,
> > > > > > > but there it is.
> > > > > > > 
> > > > > > > Regards,
> > > > > > > John
> > > > > > > 
> > > > > > > -------------------------------- John Fox Department of 
> > > > > > > Sociology McMaster University Hamilton, Ontario 
> > Canada L8S 4M4
> > > > > > > 905-525-9140x23604
> > > > > > > http://socserv.mcmaster.ca/jfox
> > > > > > > --------------------------------
> > > > > > > 
> > > > > > > > -----Original Message-----
> > > > > > > > From: Marc Schwartz [mailto:MSchwartz at MedAnalytics.com]
> > > > > > > > Sent: Friday, April 15, 2005 7:08 PM
> > > > > > > > To: John Fox
> > > > > > > > Cc: 'Dren Scott'; R-Help
> > > > > > > > Subject: RE: [R] Pearson corelation and p-value 
> for matrix
> > > > > > > > 
> > > > > > > > Here is what might be a slightly more efficient way to
> > > > > > get to John's
> > > > > > > > question:
> > > > > > > > 
> > > > > > > > cor.pvals <- function(mat)
> > > > > > > > {
> > > > > > > > rows <- expand.grid(1:nrow(mat), 1:nrow(mat)) 
> > > > matrix(apply(rows, 
> > > > > > > > 1,
> > > > > > > > function(x) cor.test(mat[x[1], ], mat[x[2], 
> > > > ])$p.value), ncol = 
> > > > > > > > nrow(mat)) }
> > > > > > > > 
> > > > > > > > HTH,
> > > > > > > > 
> > > > > > > > Marc Schwartz
> > > > > > > > 
> > > > > > > > On Fri, 2005-04-15 at 18:26 -0400, John Fox wrote:
> > > > > > > > > Dear Dren,
> > > > > > > > > 
> > > > > > > > > How about the following?
> > > > > > > > > 
> > > > > > > > > cor.pvalues <- function(X){
> > > > > > > > > nc <- ncol(X)
> > > > > > > > > res <- matrix(0, nc, nc)
> > > > > > > > > for (i in 2:nc){
> > > > > > > > > for (j in 1:(i - 1)){
> > > > > > > > > res[i, j] <- res[j, i] <- cor.test(X[,i],
> > > > > > > X[,j])$p.value
> > > > > > > > > }
> > > > > > > > > }
> > > > > > > > > res
> > > > > > > > > }
> > > > > > > > > 
> > > > > > > > > What one then does with all of those 
> > non-independent test
> > > > > > > > is another
> > > > > > > > > question, I guess.
> > > > > > > > > 
> > > > > > > > > I hope this helps,
> > > > > > > > > John
> > > > > > > > 
> > > > > > > > > > -----Original Message-----
> > > > > > > > > > From: r-help-bounces at stat.math.ethz.ch 
> > > > > > > > > > [mailto:r-help-bounces at stat.math.ethz.ch] 
> On Behalf Of
> > > > > > > Dren Scott
> > > > > > > > > > Sent: Friday, April 15, 2005 4:33 PM
> > > > > > > > > > To: r-help at stat.math.ethz.ch
> > > > > > > > > > Subject: [R] Pearson corelation and p-value 
> for matrix
> > > > > > > > > > 
> > > > > > > > > > Hi,
> > > > > > > > > > 
> > > > > > > > > > I was trying to evaluate the pearson correlation and
> > > > > > > the p-values
> > > > > > > > > > for an nxm matrix, where each row represents 
> > a vector. 
> > > > > > > > One way to do
> > > > > > > > > > it would be to iterate through each row, 
> and find its
> > > > > > > correlation
> > > > > > > > > > value( and the p-value) with respect to the 
> > other rows. 
> > > > > > > Is there
> > > > > > > > > > some function by which I can use the matrix 
> as input? 
> > > > > > > > Ideally, the
> > > > > > > > > > output would be an nxn matrix, containing 
> the p-values
> > > > > > > > between the
> > > > > > > > > > respective vectors.
> > > > > > > > > > 
> > > > > > > > > > I have tried cor.test for the iterations, but
> > > > > couldn't find a
> > > > > > > > > > function that would take the matrix as input.
> > > > > > > > > > 
> > > > > > > > > > Thanks for the help.
> > > > > > > > > > 
> > > > > > > > > > Dren
> > > > > > > > 
> > > > > > > >
> > > > > > > 
> > > > > > > ______________________________________________
> > > > > > > R-help at stat.math.ethz.ch mailing list 
> > > > > > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > > > > > PLEASE do read the posting guide! 
> > > > > > > http://www.R-project.org/posting-guide.html
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> --------------------------------------------------------------
> > > > > > ----------------
> > > > > > Notice: This e-mail message, together with any 
> > > > attachments, contains 
> > > > > > information of Merck & Co., Inc. (One Merck Drive, 
> Whitehouse 
> > > > > > Station, New Jersey, USA 08889), and/or its affiliates 
> > > > (which may be 
> > > > > > known outside the United States as Merck Frosst, Merck 
> > > > Sharp & Dohme 
> > > > > > or MSD and in Japan, as
> > > > > > Banyu) that may be confidential, proprietary 
> > copyrighted and/or 
> > > > > > legally privileged. It is intended solely for the 
> use of the 
> > > > > > individual or entity named on this message. If you 
> > are not the 
> > > > > > intended recipient, and have received this message in 
> > > > error, please 
> > > > > > notify us immediately by reply e-mail and then delete it 
> > > > from your 
> > > > > > system.
> > > > > > 
> --------------------------------------------------------------
> > > > > > ----------------
> > > > > 
> > > > > 
> > > > > 
> > > > 
> > > > 
> > > > 
> > > > --------------------------------------------------------------
> > > > ----------------
> > > > 
> > > > --------------------------------------------------------------
> > > > ----------------
> > > > 
> > > > 		
> > > > ---------------------------------
> > > > 
> > > > 
> > > > 	[[alternative HTML version deleted]]
> > > > 
> > > > ______________________________________________
> > > > R-help at stat.math.ethz.ch mailing list
> > > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > > PLEASE do read the posting guide! 
> > > > http://www.R-project.org/posting-guide.html
> > > 
> > > ______________________________________________
> > > R-help at stat.math.ethz.ch mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide! 
> > > http://www.R-project.org/posting-guide.html
> > > 
> > > 
> > > 
> > 
> > 
> > 
> > --------------------------------------------------------------
> > ----------------
> > Notice:  This e-mail message, together with any attachments, 
> > contains information of Merck & Co., Inc. (One Merck Drive, 
> > Whitehouse Station, New Jersey, USA 08889), and/or its 
> > affiliates (which may be known outside the United States as 
> > Merck Frosst, Merck Sharp & Dohme or MSD and in Japan, as 
> > Banyu) that may be confidential, proprietary copyrighted 
> > and/or legally privileged. It is intended solely for the use 
> > of the individual or entity named on this message.  If you 
> > are not the intended recipient, and have received this 
> > message in error, please notify us immediately by reply 
> > e-mail and then delete it from your system.
> > --------------------------------------------------------------
> > ----------------
> 
> 
> 
>