[R] Help with big data and parallel computing: 500, 000 x 4 linear models

Tue Aug 9 04:03:06 CEST 2016

On Mon, 8 Aug 2016, Ellis, Alicia M wrote:

> I have a large dataset with ~500,000 columns and 1264 rows.  Each column 
> represents the percent methylation at a given location in the genome. 
> I need to run 500,000 linear models for each of 4 predictors of interest 
> in the form of:

> Methylation.stie1 ~ predictor1 + covariate1+ covariate2 + ... covariate9
> ...and save only the pvalue for the predictor
>
> The original methylation data file had methylation sites as row labels 
> and the individuals as columns so I read the data in chunks and 
> transposed it so I now have 5 csv files (chunks) with columns 
> representing methylation sites and rows as individuals.
>
> I was able to get results for all of the regressions by running each 
> chunk of methylation data separately on our supercomputer using the code 
> below.

This sounds like a problem for my old laptop, not a supercomputer.

You might want to review the algebra and geometry of least squares.

In particular, covariate1 ... covariate9 are the same 1264 x 9 matrix for 
every problem IIUC. So, you can compute the QR decomposition for that 
matrix (and the unit vector `intercept') *once* and use it in all the 
problems.

Using that decomposition, find the residuals for the regressands and for 
`predictor1' (etc) regressors. The rest is simple least squares. You 
compute the correlation coefficient of the residuals of a regressand and 
those of a regressor, for each combination. Make a table of critical 
values for the p-value(s) you require - remember to get the degrees of 
freedom right (i.e. account for the covariates). These correlations of 
residuals are the partial correlations given the covariates, and a test on 
one of them is algebraically equal to the test on regression coefficient 
for corresponding regressand and regressor in a modelthat also includes 
those 9 covariates.

See:

  ?qr
  ?lm.fit

HTH,

Chuck