[R] Help with big data and parallel computing: 500, 000 x 4 linear models

Mon Aug 8 20:49:49 CEST 2016

I have a large dataset with ~500,000 columns and 1264 rows.  Each column represents the percent methylation at a given location in the genome.  I need to run 500,000 linear models for each of 4 predictors of interest in the form of:
Methylation.stie1 ~ predictor1 + covariate1+ covariate2 + ... covariate9
...and save only the pvalue for the predictor

The original methylation data file had methylation sites as row labels and the individuals as columns so I read the data in chunks and transposed it so I now have 5 csv files (chunks) with columns representing methylation sites and rows as individuals.

I was able to get results for all of the regressions by running each chunk of methylation data separately on our supercomputer using the code below.  However, I'm going to have to do this again for another project and I would really like to accomplish two things to make the whole process more computationally efficient:

1)      Work with data.tables instead of data.frames (reading and manipulating will be much easier and faster)

2)      Do the work in parallel using say 12 cores at once and having the program divide the work up on the cores rather than me having to split the data and run 5 separate jobs on the supercomputer.

I have some basic knowledge of the data.table package but I wasn't able to modify the foreach code below to get it to work and the code using data.frames didn't seem to be using all 12 cores that I created in the cluster.

Can anyone suggest some modifications to the foreach code below that will allow me to do this in parallel with datatables and not have to do it in chunks?

############# Set up cluster
clus = makeCluster(12, type = "SOCK")
registerDoSNOW(clus)
getDoParWorkers()
getDoParName()

################### Following code needs to be modified to run the full dataset (batch1-batch5) in parallel
### Currently I read in the following chunks, and run each predictor separately for each chunk of data

############### Methylation data in batches
batch1=read.csv("/home/alicia.m.ellis/batch1.csv")  ###### #Each batch has about 100,000 columns and 1264 rows; want to alter this to:
## batch1=fread(file= )
batch2=read.csv(file="/home/alicia.m.ellis/batch2.csv")
batch3=read.csv(file="/home/alicia.m.ellis/batch3.csv")
batch4=read.csv(file="/home/alicia.m.ellis/batch4.csv")
batch5=read.csv(file="/home/alicia.m.ellis/batch5.csv")

predictors  ## this is a data.frame with 4 columns and 1264 rows

covariates ## this is a data.frame with 9 columns and 1264 rows

fits <- as.data.table(batch1)[, list(MyFits = lapply(1:ncol(batch1), function(x) summary(lm(batch1[, x] ~ predictors[,1] +
                                                                                              covariates[,1]+
                                                                                              covariates[,2]+
                                                                                              covariates[,3]+
                                                                                              covariates[,4]+
                                                                                              covariates[,5]+
                                                                                              covariates[,6]+
                                                                                              covariates[,7]+
                                                                                              covariates[,8]+
                                                                                              covariates[,9]
)
)$coefficients[2,4]
)
)
]

######################################  This is what I was trying but wasn't having much luck
#### I'm having trouble getting the data merged as a single data.frame and the code below doesn't seem to be dividing the work among the 12 cores in the cluster

all. fits = foreach (j=1:ncol(predictors), i=1:ncol(meth1), combine='rbind', .inorder=TRUE) %dopar% {

  model = lm(meth[, i] ~ predictors[,j] +
               covariates[,1]+
               covariates[,2]+
               covariates[,3]+
               covariates[,4]+
               covariates[,5]+
               covariates[,6]+
               covariates[,7]+
               covariates[,8]+
               covariates[,9])
  summary(model)$coefficients[2,4]
}

Alicia Ellis, Ph.D
Biostatistician
Pathology & Laboratory Medicine
Colchester Research Facility
360 South Park Drive, Room 209C
Colchester, VT  05446
802-656-9840

	[[alternative HTML version deleted]]