[R] Parallel Scan of Large File
ryan.steven.garner at gmail.com
Wed Dec 8 02:22:57 CET 2010
Is it possible to parallel scan a large file into a character vector in 1M
chunks using scan() with the "doMC" package? Furthermore, can I specify the
tasks for each child?
i.e. I'm working on a Linux box with 8 cores and would like to scan in 8M
records at time (all 8 cores scan 1M records at a time) from a file with 40M
file <- file("data.txt","r")
child <- foreach(i = icount(40)) %dopar%
scan(file,what = "character",sep = "\n",skip = 0,nlines = 1e6)
Thus, each child would have a different skip argument. child[]: skip = 0,
child[]: skip = 1e6 + 1, child[]: skip = 2e6 + 1, ... ,child[]:
skip = 39e6 + 1. I would then end up with a list of 40 vectors with
child[] containing records 1 to 1000000, child[] containing records
1000001 to 2000000, ... ,child[] containing records 39000001 to
Also, would one file connection suffice or does their need to be a file
connection that opens and closes for each child?
View this message in context: http://r.789695.n4.nabble.com/Parallel-Scan-of-Large-File-tp3077545p3077545.html
Sent from the R help mailing list archive at Nabble.com.
More information about the R-help