[R] Help with big data and parallel computing: 500, 000 x 4 linear models
Charles C. Berry
ccberry at ucsd.edu
Tue Aug 9 04:03:06 CEST 2016
On Mon, 8 Aug 2016, Ellis, Alicia M wrote:
> I have a large dataset with ~500,000 columns and 1264 rows. Each column
> represents the percent methylation at a given location in the genome.
> I need to run 500,000 linear models for each of 4 predictors of interest
> in the form of:
> Methylation.stie1 ~ predictor1 + covariate1+ covariate2 + ... covariate9
> ...and save only the pvalue for the predictor
> The original methylation data file had methylation sites as row labels
> and the individuals as columns so I read the data in chunks and
> transposed it so I now have 5 csv files (chunks) with columns
> representing methylation sites and rows as individuals.
> I was able to get results for all of the regressions by running each
> chunk of methylation data separately on our supercomputer using the code
This sounds like a problem for my old laptop, not a supercomputer.
You might want to review the algebra and geometry of least squares.
In particular, covariate1 ... covariate9 are the same 1264 x 9 matrix for
every problem IIUC. So, you can compute the QR decomposition for that
matrix (and the unit vector `intercept') *once* and use it in all the
Using that decomposition, find the residuals for the regressands and for
`predictor1' (etc) regressors. The rest is simple least squares. You
compute the correlation coefficient of the residuals of a regressand and
those of a regressor, for each combination. Make a table of critical
values for the p-value(s) you require - remember to get the degrees of
freedom right (i.e. account for the covariates). These correlations of
residuals are the partial correlations given the covariates, and a test on
one of them is algebraically equal to the test on regression coefficient
for corresponding regressand and regressor in a modelthat also includes
those 9 covariates.
More information about the R-help