[R] Predicting correlated responses using canonical correlations in multivariate least squares

Sat Feb 27 19:34:46 CET 2016

Hi Alex,

Thanks for the detailed explanation and the reproducible example.
But it is still not clear exactly what you wish to accomplish.

You know how to calculate the scores on the canonical variates.
These could be considered 'predicted scores', but in canonical
space.  What's wrong with that?
But it seems that what you want are predicted values
for y1, y2, ... from the canonical variates for the x variables ???

I don't think your question is sufficiently well-posed, but you
might look at the literature on *redundancy analysis*, which takes
the symmetric (Y, X) canonical correlation problem and re-formulates
it into questions of how well the Ys (or Xs) are predicted.

Is the problem just that the predictors and the responses
are both highly correlated within each set?  If so, you
should probably find that there is only one strong canonical
correlation, as in your example, and be done with it.
You might also find that an HE plot (library (heplots))
is illuminating.

The Breiman/Friedman paper you refer to can't be read on my
system, and, in any case, applying shrinkage to the problem
is something to consider once you have your ducks in order.

best,
-Michael

On 2/27/2016 4:38 AM, doorz wrote:
> I have a considerable interest in trying to improve the predictions from
> 10+ highly correlated predictors to two (2) highly correlated responses.
>
> OLS does not allow me to take into account the correlation between the
> responses. Multivariate can take into account the correlations between
> the predictors.
>
> I have tried to use canonical correlations (R function cancor).
> Canonical correlations nicely show the interrelations between the
> responses and predictors. Predicting the responses using cancor is a
> different matter. There is no predict method that supports cancor.
> Simple backsubstition leads to improved correlations but poor pairwise
> correspondence. (R code I use is below, for a 2 predictor, 2 response
> set of data, this is a book example (Frets, 1921 headsize data), my data
> is similar but can't be summarized in this post as easily)
>
> The 1997 paper by Breiman and Friedman gives a procedure using shrinkage
> on least squares estimates obtained from canonical variates and their
> origins (the responses). They call this Curds and Whey. See  this post.
>
> ftp://ftp.cis.upenn.edu/pub/datamining/public_html/ReadingGroup/papers/multiResponse.pdf
>
>
> There is a master thesis (Kidd) which uses the above and promises an R
> package to handle the curds and whey procedures. That R package does not
> seem to have been made public though. See here:
>
> http://digitalcommons.usu.edu/cgi/viewcontent.cgi?article=3183&context=etd
>
> Since the Breimam and Friedman paper is almost 20 years old I am
> wondering about their procedure. One would think that it would have been
> adopted if it is as powerful as it is claimed to be since correlated
> dependents are quite common. Is there a flaw in it that I do not know
> of? Is there anything better available today?
>
> Any tips are most welcome.
>
> Regards,
> Alex van der Spek
>
> #! usr/env/bin R
> #CCA example try out
> #Use data on head size of sons/brothers length and breadth
> #Taken from the book by Everitt
>
> #Clean slate
> rm(list=ls())
>
> ## Panel function, put histograms on the diagonal
> panel.hist <- function(x, ...)
> {
>      usr <- par("usr"); on.exit(par(usr))
>      par(usr = c(usr[1:2], 0, 1.5) )
>      h <- hist(x, plot = FALSE, breaks=21)
>      breaks <- h$breaks; nB <- length(breaks)
>      y <- h$counts; y <- y/max(y)
>      rect(breaks[-nB], 0, breaks[-1], y, col="white", ...)
> }
>
> ## Panel function, put correlation in top
> panel.cor <- function(x, y, digits=2, prefix="", cex.cor)
> {
>      usr <- par("usr"); on.exit(par(usr))
>      par(usr = c(0, 1, 0, 1))
>      r <- abs(cor(x, y))
>      txt <- format(c(r, 0.123456789), digits=digits)[1]
>      txt <- paste(prefix, txt, sep="")
>      if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
>      text(0.5, 0.5, txt, cex = cex.cor * r)
> }
>
> #Bring up file selection box, set working directory and get base name
> fn <- file.choose()
> fs <- basename(fn)
> fp <- dirname(fn)
>
> #Read data in by csv
> hs.r <- read.csv(fn)
>
>
> #This is the entire data set, 25 points, two predictors (x) and two
> responses (y)
>> head(hs.r, 25)
>      x1  x2  y1  y2
> 1  191 155 179 145
> 2  195 149 201 152
> 3  181 148 185 149
> 4  183 153 188 149
> 5  176 144 171 142
> 6  208 157 192 152
> 7  189 150 190 149
> 8  197 159 189 152
> 9  188 152 197 159
> 10 192 150 187 151
> 11 179 158 186 148
> 12 183 147 174 147
> 13 174 150 185 152
> 14 190 159 195 157
> 15 188 151 187 158
> 16 163 137 161 130
> 17 195 155 183 158
> 18 186 153 173 148
> 19 181 145 182 146
> 20 175 140 165 137
> 21 192 154 185 152
> 22 174 143 178 147
> 23 176 139 176 143
> 24 197 167 200 158
> 25 190 163 187 150
>
>
> #Make a scatter plot matrix
> pairs(hs.r ,lower.panel = panel.smooth, upper.panel = panel.cor,
> diag.panel = panel.hist)
>
> #Sweep out summary stats
> hs.s <- scale(hs.r, center=F, scale=T)
> hs.c <- scale(hs.r, center=T, scale=F)
> hs.a <- scale(hs.r, center=T, scale=T)
>
> #Make a scatter plot matrix of centered and scaled, standardized data
> pairs(hs.a ,lower.panel = panel.smooth, upper.panel = panel.cor,
> diag.panel = panel.hist)
> hs.1 <- hs.a[,1:2]
> hs.2 <- hs.a[,3:4]
>
> cc <- cancor(hs.1, hs.2, xcenter=F, ycenter=F)
>
> #Compute the decompositon
> #u1 <- cc$xcoef[1,1] * hs.1[,'x1'] + cc$xcoef[1,2] * hs.1[,'x2']
> #v1 <- cc$ycoef[1,1] * hs.2[,'y1'] + cc$ycoef[1,2] * hs.2[,'y2']
> #u2 <- cc$xcoef[2,1] * hs.1[,'x1'] + cc$xcoef[2,2] * hs.1[,'x2']
> #v2 <- cc$ycoef[2,1] * hs.2[,'y1'] + cc$ycoef[2,2] * hs.2[,'y2']
>
> #Quicker and better
> uu <- hs.1 %*% cc$xcoef
> vv <- hs.2 %*% cc$ycoef
>
> #Store in a dataframe
> uv <- data.frame(u1=uu[,1], v1=vv[,1], u2=uu[,2], v2=vv[,2])
>
> #Make a scatter plot matrix
> pairs(uv ,lower.panel = panel.smooth, upper.panel = panel.cor,
> diag.panel = panel.hist)
>
> #Make a linear model, the significant coefs will prove to be the
> eigenvalues
> lm.uv <- lm(cbind(u1, u2) ~ v1 + v2 - 1, data = uv)
> summary(lm.uv)
>
> #Notation  x * k * A  == y * B, solve for y
> A <- cc$xcoef
> B <- cc$ycoef
> r <- cc$cor
>
> #Compute the back substitioon
> C <-  A %*% diag(r) %*% solve(B)
>
> yy <- hs.1 %*% C
> colnames(yy) <- c('yy1', 'yy2')
>
> #Show how this relates to the original responses, not bad
> pairs(cbind(hs.1, hs.2, yy), lower.panel = panel.smooth, upper.panel =
> panel.cor, d
>
> #But pairwise the correspondence is poor.
> head(cbind(hs.2, yy)
>
>> head(cbind(hs.2, yy))
>               y1          y2         yy1          yy2
> [1,] -0.4820596 -0.63189808  0.43234091  0.434382250
> [2,]  1.7091204  0.41132988  0.30938540  0.259340111
> [3,]  0.1155349 -0.03576782 -0.36892523 -0.368659286
> [4,]  0.4143322 -0.03576782 -0.02725575 -0.005036034
> [5,] -1.2788523 -1.07899577 -0.79475501 -0.798376574
> [6,]  0.8127286  0.41132988  1.29559865  1.241261763
>