[R] string-to-number

Marc Schwartz MSchwartz at mn.rr.com
Sun Aug 20 16:23:26 CEST 2006


On Sat, 2006-08-19 at 10:25 -0600, Mike Nielsen wrote:
> Wow.  New respect for parse/eval.
> 
> Do you think this is a special case of a more general principle?  I
> suppose the cost is memory, but from time to time a speedup like this
> would be very beneficial.
> 
> Any hints about how R programmers could recognize such cases would, I
> am sure, be of value to the list in general.
> 
> Many thanks for your efforts, Marc!

Mike,

I think that one needs to consider where the time is being spent and
then adjust accordingly. Once you understand that, you can develop some
insight into what may be a more efficient approach. R provides good
profiling tools that facilitate this process.

In this case, almost all of the time in the first two examples using
strsplit(), is in that function:

> repeated.measures.columns <- paste(1:100000, collapse = ",")

> library(utils)
> Rprof(tmp <- tempfile())
> res1 <- as.numeric(unlist(strsplit(repeated.measures.columns, ",")))
> Rprof()

> summaryRprof(tmp)
$by.self
                    self.time self.pct total.time total.pct
"strsplit"              23.68     99.7      23.68      99.7
"as.double.default"      0.06      0.3       0.06       0.3
"as.numeric"             0.00      0.0      23.74     100.0
"unlist"                 0.00      0.0      23.68      99.7

$by.total
                    total.time total.pct self.time self.pct
"as.numeric"             23.74     100.0      0.00      0.0
"strsplit"               23.68      99.7     23.68     99.7
"unlist"                 23.68      99.7      0.00      0.0
"as.double.default"       0.06       0.3      0.06      0.3

$sampling.time
[1] 23.74


Contrast that with Prof. Ripley's approach:

> Rprof(tmp <- tempfile())
> res3 <- eval(parse(text=paste("c(", repeated.measures.columns, ")")))
> Rprof()

> summaryRprof(tmp)
$by.self
        self.time self.pct total.time total.pct
"parse"      0.42     87.5       0.42      87.5
"eval"       0.06     12.5       0.48     100.0

$by.total
        total.time total.pct self.time self.pct
"eval"        0.48     100.0      0.06     12.5
"parse"       0.42      87.5      0.42     87.5

$sampling.time
[1] 0.48


To some extent, one could argue that my initial timing examples are
contrived, in that they specifically demonstrate a worst case scenario
using strsplit().  Real world examples may or may not show such gains.

For example with Charles' initial query, the initial vector was rather
short:

  > repeated.measures.columns
  [1] "3,6,10"

So if this was a one-time conversion, we would not see such significant
gains.

However, what if we had a long list of shorter entries:

> repeated.measures.columns <- paste(1:10, collapse = ",")
> repeated.measures.columns
[1] "1,2,3,4,5,6,7,8,9,10"

> big.list <- replicate(10000, list(repeated.measures.columns))

> head(big.list)
[[1]]
[1] "1,2,3,4,5,6,7,8,9,10"

[[2]]
[1] "1,2,3,4,5,6,7,8,9,10"

[[3]]
[1] "1,2,3,4,5,6,7,8,9,10"

[[4]]
[1] "1,2,3,4,5,6,7,8,9,10"

[[5]]
[1] "1,2,3,4,5,6,7,8,9,10"

[[6]]
[1] "1,2,3,4,5,6,7,8,9,10"


> system.time(res1 <- t(sapply(big.list, function(x)
as.numeric(unlist(strsplit(x, ","))))))
[1] 1.972 0.044 2.411 0.000 0.000

> str(res1)
 num [1:10000, 1:10] 1 1 1 1 1 1 1 1 1 1 ...

> head(res1)
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    1    2    3    4    5    6    7    8    9    10
[2,]    1    2    3    4    5    6    7    8    9    10
[3,]    1    2    3    4    5    6    7    8    9    10
[4,]    1    2    3    4    5    6    7    8    9    10
[5,]    1    2    3    4    5    6    7    8    9    10
[6,]    1    2    3    4    5    6    7    8    9    10



Now use Prof. Ripley's approach:

> system.time(res3 <- t(sapply(big.list, function(x)
eval(parse(text=paste("c(", x, ")"))))))
[1] 1.676 0.012 1.877 0.000 0.000

> str(res3)
 num [1:10000, 1:10] 1 1 1 1 1 1 1 1 1 1 ...

> head(res3)
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    1    2    3    4    5    6    7    8    9    10
[2,]    1    2    3    4    5    6    7    8    9    10
[3,]    1    2    3    4    5    6    7    8    9    10
[4,]    1    2    3    4    5    6    7    8    9    10
[5,]    1    2    3    4    5    6    7    8    9    10
[6,]    1    2    3    4    5    6    7    8    9    10



> all(res1 == res3)
[1] TRUE


We do see a notable reduction in time with strsplit(), while a notable
increase in time using eval(parse)), even though we are converting the
same net number of values (100,000).

Much of the increase with eval(parse()) is of course due to the overhead
of sapply() and navigating the list.


Let's increase the size of the list components to 1000:

> repeated.measures.columns <- paste(1:1000, collapse = ",")
> big.list <- replicate(10000, list(repeated.measures.columns))

> system.time(res1 <- sapply(big.list, function(x)
as.numeric(unlist(strsplit(x, ",")))))
[1] 33.270  0.744 37.163  0.000  0.000

> system.time(res3 <- t(sapply(big.list, function(x)
eval(parse(text=paste("c(", x, ")"))))))
[1] 15.893  0.928 18.139  0.000  0.000


So we see here that as the size of the list components increases, there
continues to be an advantage to Prof. Ripley's approach over using
strsplit().

Again, one needs to develop an understanding of where the time is spent
in the processing by profiling and then consider how to introduce
efficiencies, which in some cases may very well require the use of
compiled C/FORTRAN as may be appropriate if times become too long.

HTH,

Marc Schwartz



More information about the R-help mailing list