[R] reading in results from system(). There must be an easier way...
Henrik Bengtsson
hb at stat.berkeley.edu
Fri Sep 12 19:40:12 CEST 2008
Hi,
a few comments below.
On Fri, Sep 12, 2008 at 9:34 AM, Michael A. Gilchrist <mikeg at utk.edu> wrote:
> Hello,
>
> I am currently using R to run an external program and then read the results
> the external program sends to the stdout which are tsv data.
>
> When R reads the results in it converts it to to a list of strings which I
> then have to maniuplate with a whole slew of commands (which, figuring out
> how to do was a reall challenge for a newbie like myself)--see below.
>
> Here's the code I'm using. COMMAND runs the external program.
>
> rawInput= system(COMMAND,intern=TRUE);##read in tsv values
For debugging purposes etc, it is good to read the data into a buffer
like this; instead of wrapping up everything in one big nested
expression. The overhead for doing this should be minimal.
> rawInput = strsplit(rawInput, split="\t");##split elements w/in the list
FYI, strsplit(x, split="\t", fixed=TRUE) is *heaps* faster (than
fixed=FALSE), e.g.
> x <- paste(1:3e4, collapse="\t")
> t <- system.time(y <- strsplit(x, split="\t"))
> t
user system elapsed
2.89 0.00 2.89
> t <- system.time(y <- strsplit(x, split="\t", fixed=TRUE))
> t
user system elapsed
0 0 0
> ##of character strings by "\t"
> rawInput = unlist(rawInput); ##unlist, making it one long vector
FYI, unlist(x, use.names=FALSE) is faster, especially when 'x' is long/large.
> mode(rawInput)="double"; ##convert from strings to double
> finalInput = data.frame(t(matrix(rawInput, nrow=6))); ##convert
Taking the transpose t() takes time - requires a copy in memory. Do
you really need data transposed?
Converting a matrix to a data frame takes time. Do you really need
data as a data frame?
>
> Because I will be doing this 100,000 of times as part of an optimization
> problem, I am interested in learning a more efficient way of doing this
> conversion.
Do you need the data in each iteration? If not, collect the data as
strings and then do the coercing to doubles and turning it into a
matrix all together. That is likely to be faster because there is a
bit of overhead in each iteration.
As suggested, using scan() and providing R with as much hints as
possible - explicit arguments to scan() when you know something about
the input so that R doesn't have to guess - will also speed things up.
parseA <- function(x, ...) {
y <- strsplit(x, split="\t", fixed=FALSE);
y <- unlist(y);
y <- as.double(y);
}
parseB <- function(x, ...) {
y <- strsplit(x, split="\t", fixed=TRUE);
y <- unlist(y, use.names=FALSE);
y <- as.double(y);
}
parseC <- function(x, ...) {
con <- textConnection(x);
on.exit(close(con));
y <- scan(file=con, what=double(0), sep="\t", quiet=TRUE);
y;
}
parseD <- function(x, ...) {
con <- textConnection(x);
on.exit(close(con));
y <- scan(file=con, what=double(0), sep="\t", quote=NULL,
na.strings=NULL, strip.white=FALSE, comment.char="",
allowEscapes=FALSE, quiet=TRUE);
y;
}
> x <- paste(1:3e4, collapse="\t");
> tA <- system.time(yA <- parseA(x));
> tA;
user system elapsed
2.91 0.00 2.91
> tB <- system.time(yB <- parseB(x));
> tB;
user system elapsed
0.03 0.00 0.04
> tC <- system.time(yC <- parseC(x));
> tC;
user system elapsed
0.03 0.00 0.03
> tD <- system.time(yD <- parseD(x));
> tD;
user system elapsed
0.03 0.00 0.03
> x <- paste(1:1e6, collapse="\t");
# parseA() painfully slow
> tB <- system.time(yB <- parseB(x));
> tB
user system elapsed
2.30 0.00 2.31
> tC <- system.time(yC <- parseC(x));
> tC
user system elapsed
1.14 0.00 1.16
> tD <- system.time(yD <- parseD(x));
> tD
user system elapsed
1.16 0.01 1.17
Ok, so parseD() doesn't seem to be much faster than parseC(), but
depending on your output format it may be.
Take home message: read the help pages and try to help R as much as
possible so it does not have to guess. You can always make your code
twice as fast!
/HB
>
> Any suggestions would be appreciated.
>
>
> Thanks in advance.
>
> Mike
>
>
> -----------------------------------------------------
> Department of Ecology & Evolutionary Biology
> 569 Dabney Hall
> University of Tennessee
> Knoxville, TN 37996-1610
>
> phone:(865) 974-6453
> fax: (865) 974-6042
>
> web: http://eeb.bio.utk.edu/gilchrist.asp
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list