[R] R on Multicore for Linux
ggrothendieck at gmail.com
Fri Jul 22 08:44:32 CEST 2011
On Thu, Jul 21, 2011 at 3:20 PM, Madana_Babu <madana_babu at infosys.com> wrote:
> Hi all,
> Currently i am trying to this on R which is running on multicore processor.
> I am not sure how to use mclapply() function on this task. Can anyone help
> # Setting up directory
> # Data is available in the form of multiple structured log files (nearly 10K
> log files)
> # I am using the following syntax to get required fields and aggregations
> from the logs and creating a file called DF (with 3 columns V2, V14 and
> a <- list.files(path = ".", pattern = "2011-07-20", all.files = FALSE,
> full.names = FALSE, recursive = FALSE, ignore.case = FALSE)
> DF <- NULL
> for (f in a)
> dat <- read.csv(f, header=FALSE, sep="\t", na.strings="",dec=".",
> strip.white=TRUE, fill=TRUE)
> data_1 <- sqldf("SELECT V2, V14, MIN(V16) FROM dat WHERE V6=104 GROUP BY
> V2, V14")
> DF <- rbind(DF, data_1)
> # Currently this process is taking almost 3 Hrs for me.
> Can anyone help me to use mclapply() on this operation and get this process
> completed asap.
data_1 <- read.csv.sql(a, sql = "select ...", header = FALSE, dbname =
":memory:", sep = "\t", ...whatever...)
That has two advantages (a) you will be using sqlite's read routines
rather than R's and they may be faster and (b) instead of: File --> R
--> sqlite --> R it will be just File --> sqlite --> R.
See ?read.csv.sql since the arguments are not identical to read.table
(as they are based on sqlite's read routines, not R's.) Above we
have assumed each file is individually small enough to fit into memory
but if that is not the case omit the dbname= argument and it will use
files instead. sqlite's read routines can't cope with everything R
can but if your file format is sufficiently vanilla it should work.
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
email: ggrothendieck at gmail.com
More information about the R-help