[R] Help with colinearity problem in multiple linear regression

Thu Mar 6 20:16:07 CET 2008

Hello,

  For basic linear regression lm() does the job well, for datasets that are
larger than memory biglm() seems to work.

  I'm working on a parallel implementation of multiple linear regression for
datasets that are too large for memory.

  Currently I am working over least squares:
    calculating: t(X) %*% X  and t(X) %*% y
    separately in parallel on each node

  This generates a small nxn matrix and an n vector which can then be
brought together, summed, and solved for the linear regression result:

   ( solve(t(X) %*% X) )  %*%   ( t(X) %*% y )

  Except that this can have problems with t(X) %*% X being singular.

  My understanding is that both lm() and biglm() deal with this via using QR
decomposition, but I can't figure out how this technique would be combined
with intermediate summary results which is required for the parallelism
technique to work.

  Does anyone have any pointers how this might be accomplished?

Thanks,
  Caleb

-------
# Bigger example

# Introduce artificial colinearity to test case
P = cbind(USArrests[2]*2, USArrests)
names(P) = c("Introduced", "Murder", "Assault", "UrbanPop", "Rape")

# Split the data into partitions to be calculated separately
# In the real case the full data would have been too large and
# each instance would have received just a chunk of data.
P1 = P[1:25,]
P2 = P[26:50]

# Partition 1 - calculated on host 1
P1_X = as.matrix(cbind(1, P1[,c("UrbanPop", "Assault", "Rape",
"Introduced")]))

P1_A = t(P1_X) %*% P1_X
P1_b = t(P1_X) %*% P1[,"Murder"]

# Partition 2 - calculated on host 2
P2_X = as.matrix(cbind(1, P2[,c("UrbanPop", "Assault", "Rape",
"Introduced")]))

P2_A = t(P2_X) %*% P2_X
P2_b = t(P2_X) %*% P2[,"Murder"]

# repeat for partitions 3:N
# ...

# collect data from the all the partitions, this is the key for
# the parallelism technique, I don't know how to do this step if
# I was somehow doing QR decomposition.

A = P1_A + P2_A      # ... + P3_A  ... + PN_A
b = P1_b + P2_b      # ... + P3_b  ... + PN_b

# calculate regression, this fails because of sigularity
solve(A) %*% b

# If I exclude the introduced column it works, but I'm not
# sure how this would be generalized.
solve(A[-5,-5]) %*% b[-5]

# Compare to lm()
lm(Murder~UrbanPop+Assault+Rape+Introduced, P)