[R] making code (loop) more efficient
Ana Marija
@okov|c@@n@m@r|j@ @end|ng |rom gm@||@com
Wed Dec 16 01:11:27 CET 2020
Hello,
I made a terribly inefficient code which runs forever but it does run.
library(dplyr)
library(splitstackshape)
datalist = list()
files <- list.files("/WEIGHTS1/Retina", pattern=".RDat", ignore.case=T)
for(i in files)
{
a<-get(load(i))
names <- rownames(a)
data <- as.data.frame(cbind(names,a))
rownames(data) <- NULL
dd=na.omit(concat.split.multiple(data = data, split.cols = c("names"),
seps = ":"))
dd=select(dd,names_1,blup,names_3,names_4)
colnames(dd)=c("rsid","weight","ref_allele","eff_allele")
dd$WGT<-i
datalist[[i]] <- dd # add it to your list
}
big_data = do.call(rbind, datalist)
There is 17345 RDat files this loop has to go through. And each file
has approximately 10,000 lines. All RDat files can be downloaded from
here: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE115828 and
they are compressed in this file: GSE115828_retina_TWAS_wgts.tar.gz .
And subset of 3 of those .RDat files is here:
https://github.com/montenegrina/sample
For one of those files, say i="retina.ENSG00000135776.wgt.RDat"
dd looks like this:
> head(dd)
rsid weight ref_allele eff_allele
1: rs72763981 9.376766e-09 C G
2: rs144383755 -2.093346e-09 A G
3: rs1925717 1.511376e-08 T C
4: rs61827307 -1.625302e-08 C A
5: rs61827308 -1.625302e-08 G C
6: rs199623136 -9.128354e-10 GC G
WGT
1: retina.ENSG00000135776.wgt.RDat
2: retina.ENSG00000135776.wgt.RDat
3: retina.ENSG00000135776.wgt.RDat
4: retina.ENSG00000135776.wgt.RDat
5: retina.ENSG00000135776.wgt.RDat
6: retina.ENSG00000135776.wgt.RDat
so on attempt to parallelize this I did this:
library(parallel)
library(data.table)
library(foreach)
library(doSNOW)
n <- parallel::detectCores()
cl <- parallel::makeCluster(n, type = "SOCK")
doSNOW::registerDoSNOW(cl)
files <- list.files("/WEIGHTS1/Retina", pattern=".RDat", ignore.case=T)
lst_out <- foreach::foreach(i = seq_along(files),
.packages = c("data.table") ) %dopar% {
a <- get(load(files[i]))
names <- rownames(a)
data <- data.table(names, a["blup"])
nm1 <- c("rsid", "ref_allele", "eff_allele")
data[, (nm1) := tstrsplit(names, ":")[-2]]
return(data[, .(rsid, weight = blup, ref_allele, eff_allele)][,
WGT := files[i]][])
}
parallel::stopCluster(cl)
big_data <- rbindlist(lst_out)
I am getting this Error:
Error in { : task 7 failed - "object 'blup' not found"
> parallel::stopCluster(cl)
Can you please advise,
Ana
More information about the R-help
mailing list