[R] Generation of correlated variables

Fri Mar 16 13:33:46 CET 2012

On Thu, Mar 15, 2012 at 10:48:48AM -0700, Filoche wrote:
> Hi everyone.
> 
> Based on a dependent variable (y), I'm trying to generate some independent
> variables with a specified correlation. For this there's no problems.
> However, I would like that have all my "regressors" to be orthogonal (i.e.
> no correlation among them.
> 
> For example, 
> 
> y = x1 + x2 + x3 where the correlation between y x1 = 0.7, x2 = 0.4 and x3 =
> 0.8.  However, x1, x2 and x3 should not be correlated to each other.

Hi.

The previous discussion shows that the required correlations
should have sum of squares 1. It is not clear, whether your
expected application allows to normalize the correlations to
satisfy this condition. If this normalization is possible, then
try the following approach. 

It creates a matrix trans, which satisfies the following.
If Z is a matrix whose columns have zero mean and the identity covariance
matrix, then Z %*% trans has the required correlation structure.
This may be checked by looking at cov2cor(t(trans) %*% trans).

Moreover, trans should have the first column c(1, 0, 0) so that
the first column of Z is not changed by the transformation.

  # normalize the correlations (see the discussion above)
  rho <- c(0.7, 0.4, 0.8)
  rho <- rho/sqrt(sum(rho^2))

  (required <- cbind(c(1, rho), rbind(c(rho), diag(3))))

  trans <- cbind(
  y=rho,
  x1=c(rho[1], 0, 0),
  x2=c(0, rho[2], 0),
  x3=c(0, 0, rho[3]))

  # check that trans generates the required correlations
  max(abs(cov2cor(t(trans) %*% trans) - required)) # [1] 1.110223e-16

  # get orthonormal columns
  gram.schmidt <- function(a)
  {
      n <- ncol(a)
      for (i in seq.int(length=n)) {
          for (j in seq.int(length=i-1)) {
              a[, i] <- a[, i] - sum(a[, j]*a[, i])*a[, j]
          }
          a[, i] <- a[, i]/sqrt(sum(a[, i]^2))
      }
      a
  }

  # multiply trans by a unitary matrix, so that the first column becomes (1, 0, 0)
  U <- gram.schmidt(trans[, 1:3])
  trans <- t(U) %*% trans

  # check that trans still generates the required correlations
  max(abs(cov2cor(t(trans) %*% trans) - required)) # [1] 1.110223e-16

  # choose an example of the real vector y and normalize
  y <- rnorm(100)
  ynorm <- y - mean(y)
  ynorm <- ynorm/sqrt(sum(ynorm^2))

  # find two more normalized vectors orthogonal to ynorm
  u <- gram.schmidt(cbind(1, ynorm, rnorm(length(y)), rnorm(length(y))))[, 3:4]
  Z <- cbind(ynorm, u) 

  # get the solution as suggested at the beginning
  S <- Z %*% trans

  # check that S is a solution for the normalized y (ynorm)
  max(abs(cor(S) - required)) # [1] 1.110223e-16
  max(abs(S[, 1] - ynorm)) # [1] 8.326673e-17
  max(abs(S[, 2] + S[, 3] + S[, 4] - ynorm)) # [1] 1.110223e-16

This solution contains "ynorm". In order to get "y" as required, a
linear transformation, which is inverse to the normalization, should
be done. Since it is linear, it does not change the correlations.

Hope this helps.

Petr Savicky.