[R] Help with big data and parallel computing: 500, 000 x 4 linear models

Aaron Mackey ajmackey at gmail.com
Mon Aug 8 23:33:29 CEST 2016


Don't run 500K separate models. Use the limma package to fit one model that
can learn the variance parameters jointly. Run it on your laptop. And don't
use %methylation as your Y variable, use logit(percent), i.e. the Beta
value.

-Aaron

On Mon, Aug 8, 2016 at 2:49 PM, Ellis, Alicia M <Alicia.Ellis at med.uvm.edu>
wrote:

> I have a large dataset with ~500,000 columns and 1264 rows.  Each column
> represents the percent methylation at a given location in the genome.  I
> need to run 500,000 linear models for each of 4 predictors of interest in
> the form of:
> Methylation.stie1 ~ predictor1 + covariate1+ covariate2 + ... covariate9
> ...and save only the pvalue for the predictor
>
> The original methylation data file had methylation sites as row labels and
> the individuals as columns so I read the data in chunks and transposed it
> so I now have 5 csv files (chunks) with columns representing methylation
> sites and rows as individuals.
>
> I was able to get results for all of the regressions by running each chunk
> of methylation data separately on our supercomputer using the code below.
> However, I'm going to have to do this again for another project and I would
> really like to accomplish two things to make the whole process more
> computationally efficient:
>
>
> 1)      Work with data.tables instead of data.frames (reading and
> manipulating will be much easier and faster)
>
> 2)      Do the work in parallel using say 12 cores at once and having the
> program divide the work up on the cores rather than me having to split the
> data and run 5 separate jobs on the supercomputer.
>
> I have some basic knowledge of the data.table package but I wasn't able to
> modify the foreach code below to get it to work and the code using
> data.frames didn't seem to be using all 12 cores that I created in the
> cluster.
>
> Can anyone suggest some modifications to the foreach code below that will
> allow me to do this in parallel with datatables and not have to do it in
> chunks?
>
>
> ############# Set up cluster
> clus = makeCluster(12, type = "SOCK")
> registerDoSNOW(clus)
> getDoParWorkers()
> getDoParName()
>
>
> ################### Following code needs to be modified to run the full
> dataset (batch1-batch5) in parallel
> ### Currently I read in the following chunks, and run each predictor
> separately for each chunk of data
>
> ############### Methylation data in batches
> batch1=read.csv("/home/alicia.m.ellis/batch1.csv")  ###### #Each batch
> has about 100,000 columns and 1264 rows; want to alter this to:
> ## batch1=fread(file= )
> batch2=read.csv(file="/home/alicia.m.ellis/batch2.csv")
> batch3=read.csv(file="/home/alicia.m.ellis/batch3.csv")
> batch4=read.csv(file="/home/alicia.m.ellis/batch4.csv")
> batch5=read.csv(file="/home/alicia.m.ellis/batch5.csv")
>
> predictors  ## this is a data.frame with 4 columns and 1264 rows
>
> covariates ## this is a data.frame with 9 columns and 1264 rows
>
> fits <- as.data.table(batch1)[, list(MyFits = lapply(1:ncol(batch1),
> function(x) summary(lm(batch1[, x] ~ predictors[,1] +
>
>                     covariates[,1]+
>
>                     covariates[,2]+
>
>                     covariates[,3]+
>
>                     covariates[,4]+
>
>                     covariates[,5]+
>
>                     covariates[,6]+
>
>                     covariates[,7]+
>
>                     covariates[,8]+
>
>                     covariates[,9]
> )
> )$coefficients[2,4]
> )
> )
> ]
>
>
> ######################################  This is what I was trying but
> wasn't having much luck
> #### I'm having trouble getting the data merged as a single data.frame and
> the code below doesn't seem to be dividing the work among the 12 cores in
> the cluster
>
> all. fits = foreach (j=1:ncol(predictors), i=1:ncol(meth1),
> combine='rbind', .inorder=TRUE) %dopar% {
>
>   model = lm(meth[, i] ~ predictors[,j] +
>                covariates[,1]+
>                covariates[,2]+
>                covariates[,3]+
>                covariates[,4]+
>                covariates[,5]+
>                covariates[,6]+
>                covariates[,7]+
>                covariates[,8]+
>                covariates[,9])
>   summary(model)$coefficients[2,4]
> }
>
>
> Alicia Ellis, Ph.D
> Biostatistician
> Pathology & Laboratory Medicine
> Colchester Research Facility
> 360 South Park Drive, Room 209C
> Colchester, VT  05446
> 802-656-9840
>
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

	[[alternative HTML version deleted]]



More information about the R-help mailing list