[R] reading in results from system(). There must be an easier way...

Fri Sep 12 19:40:12 CEST 2008

Hi,

a few comments below.

On Fri, Sep 12, 2008 at 9:34 AM, Michael A. Gilchrist <mikeg at utk.edu> wrote:
> Hello,
>
> I am currently using R to run an external program and then read the results
> the external program sends to the stdout which are tsv data.
>
> When R reads the results in it converts it to to a list of strings which I
> then have to maniuplate with a whole slew of commands (which, figuring out
> how to do was a reall challenge for a newbie like myself)--see below.
>
> Here's the code I'm using.  COMMAND runs the external program.
>
>    rawInput= system(COMMAND,intern=TRUE);##read in tsv values

For debugging purposes etc, it is good to read the data into a buffer
like this; instead of wrapping up everything in one big nested
expression.  The overhead for doing this should be minimal.

>    rawInput = strsplit(rawInput, split="\t");##split elements w/in the list

FYI, strsplit(x, split="\t", fixed=TRUE) is *heaps* faster (than
fixed=FALSE), e.g.

> x <- paste(1:3e4, collapse="\t")
> t <- system.time(y <- strsplit(x, split="\t"))
> t
   user  system elapsed
   2.89    0.00    2.89
> t <- system.time(y <- strsplit(x, split="\t", fixed=TRUE))
> t
   user  system elapsed
      0       0       0

>                                              ##of character strings by "\t"
>    rawInput = unlist(rawInput); ##unlist, making it one long vector

FYI, unlist(x, use.names=FALSE) is faster, especially when 'x' is long/large.

>    mode(rawInput)="double"; ##convert from strings to double
>    finalInput = data.frame(t(matrix(rawInput, nrow=6))); ##convert

Taking the transpose t() takes time - requires a copy in memory.  Do
you really need data transposed?

Converting a matrix to a data frame takes time.  Do you really need
data as a data frame?

>
> Because I will be doing this 100,000 of times as part of an optimization
> problem, I am interested in learning a more efficient way of doing this
> conversion.

Do you need the data in each iteration?  If not, collect the data as
strings and then do the coercing to doubles and turning it into a
matrix all together.  That is likely to be faster because there is a
bit of overhead in each iteration.

As suggested, using scan() and providing R with as much hints as
possible - explicit arguments to scan() when you know something about
the input so that R doesn't have to guess - will also speed things up.

parseA <- function(x, ...) {
  y <- strsplit(x, split="\t", fixed=FALSE);
  y <- unlist(y);
  y <- as.double(y);
}

parseB <- function(x, ...) {
  y <- strsplit(x, split="\t", fixed=TRUE);
  y <- unlist(y, use.names=FALSE);
  y <- as.double(y);
}

parseC <- function(x, ...) {
  con <- textConnection(x);
  on.exit(close(con));
  y <- scan(file=con, what=double(0), sep="\t", quiet=TRUE);
  y;
}

parseD <- function(x, ...) {
  con <- textConnection(x);
  on.exit(close(con));
  y <- scan(file=con, what=double(0), sep="\t", quote=NULL,
na.strings=NULL, strip.white=FALSE, comment.char="",
allowEscapes=FALSE, quiet=TRUE);
  y;
}

> x <- paste(1:3e4, collapse="\t");
> tA <- system.time(yA <- parseA(x));
> tA;
   user  system elapsed
   2.91    0.00    2.91
> tB <- system.time(yB <- parseB(x));
> tB;
   user  system elapsed
   0.03    0.00    0.04
> tC <- system.time(yC <- parseC(x));
> tC;
   user  system elapsed
   0.03    0.00    0.03
> tD <- system.time(yD <- parseD(x));
> tD;
   user  system elapsed
   0.03    0.00    0.03

> x <- paste(1:1e6, collapse="\t");
# parseA() painfully slow
> tB <- system.time(yB <- parseB(x));
> tB
   user  system elapsed
   2.30    0.00    2.31
> tC <- system.time(yC <- parseC(x));
> tC
   user  system elapsed
   1.14    0.00    1.16
> tD <- system.time(yD <- parseD(x));
> tD
   user  system elapsed
   1.16    0.01    1.17

Ok, so parseD() doesn't seem to be much faster than parseC(), but
depending on your output format it may be.

Take home message: read the help pages and try to help R as much as
possible so it does not have to guess.  You can always make your code
twice as fast!

/HB

>
> Any suggestions would be appreciated.
>
>
> Thanks in advance.
>
> Mike
>
>
> -----------------------------------------------------
> Department of Ecology & Evolutionary Biology
> 569 Dabney Hall
> University of Tennessee
> Knoxville, TN 37996-1610
>
> phone:(865) 974-6453
> fax:  (865) 974-6042
>
> web: http://eeb.bio.utk.edu/gilchrist.asp
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>