[R] saving a character vector

Philippe Grosjean phgrosjean at sciviews.org
Sun Feb 5 09:22:55 CET 2006


If I understand the question correctly, both Jim Holtman's and John 
Fox's answers are correct solutions. However, they are not optimal ones 
(that was not the question -optimize my code, please-, but one can talk 
about it a little bit).

- Jim proposes (I rework a little bit his code):
generateIndex1 <- function(n.item) {
     Res <- character(0)  # initialize vector
     for (i in 1:(n.item - 1)) { # John Fox's correction introduced
         for (j in ((i+1):n.item)) {
             # concatenate the results
             Res <- c(Res, paste("i", formatC(i, digits = 2, flag = "0"),
                 ".", formatC(j, digits = 2, flag = "0"), sep = ""))
         }
     }
     Res
}

- John Fox proposes:
generateIndex2 <- function(n.item) {
     result <- rep("", n.item * (n.item - 1) / 2)
     index <- 0
     for (i in 1:(n.item - 1)) {
         for (j in ((i + 1):n.item)) {
             index <- index + 1
             result[index] <- paste("i",
                 formatC(i, digits = 2, flag = "0"), ".",
                 formatC(j, digits = 2, flag = "0"), sep = "")
         }
     }
     result
}

The difference is that Jim creates an empty character vector and 
concatenate to it (simplest code), and John creates a vector of empty 
characters of the correct size [result <- rep("", n.item * (n.item - 1) 
/ 2)]. The second solution is supposed to be better, because "result" is 
supposed to be of the right size, limiting useless memory pagination 
inside each loop iteration. However:

 > system.time(generateIndex1(100))
[1] 4.86 0.00 4.86   NA   NA
 > system.time(generateIndex2(100))
[1] 4.68 0.00 4.68   NA   NA

There is not much difference (well, indeed, the loops and what's 
calculated repreatedly inside takes much more time in this case). 
However, I wonder what happens if I allocate a vector of the right size 
with strings having also the right size:

generateIndex3 <- function(n.item) {
     result <- rep("i000.000", n.item * (n.item - 1) / 2)
     index <- 0
     for (i in 1:(n.item - 1)) {
         for (j in ((i + 1):n.item)) {
             index <- index + 1
             result[index] <- paste("i",
                 formatC(i, digits = 2, flag = "0"), ".",
                 formatC(j, digits = 2, flag = "0"), sep = "")
         }
     }
     result
}

 > system.time(generateIndex3(100))
[1] 4.63 0.02 4.66   NA   NA

... About the same. **Could someone explain me here, please?**

Now, where is the bottleneck?

 > Rprof()
 > res <- generateIndex3(100)
 > Rprof(NULL)
 > ?summaryRprof
 > summaryRprof()
$by.self
                    self.time self.pct total.time total.pct
formatC                 0.48     10.5       4.30      93.9
paste                   0.46     10.0       4.54      99.1
pmax                    0.44      9.6       0.66      14.4
as.integer              0.30      6.6       0.34       7.4
as.logical              0.24      5.2       0.34       7.4
names                   0.20      4.4       0.24       5.2
...

Gosh! For sure: Why do I call FormatC() every time twice in the loop? I 
can increase speed by formatting my character strings only once!

generateIndex4 <- function(n.item) {
     result <- rep("i000.000", n.item * (n.item - 1) / 2)
     index <- 0
     id <- formatC(1:n.item, digits = 2, flag = "0")
     for (i in 1:(n.item - 1)) {
         for (j in ((i + 1):n.item)) {
             index <- index + 1
             result[index] <- paste("i", id[i], ".", id[j], sep = "")
         }
     }
     result
}

 > system.time(generateIndex4(100))
[1] 0.33 0.00 0.33   NA   NA

Yes! That's much better.
Now, recall that it is better to use a vectorized algorithm than loops, 
could I get rid of these two ugly loops? Here is something using outer() 
and lower.tri():

generateIndex5 <- function(n.item) {
     idx <- function(x, y) paste("i", x, ".", y, sep = "")
     id <- formatC(1:n.item, digits = 2, flag = "0")
     allidx <- t(outer(id, id, idx))
     allidx[lower.tri(allidx)]
}

 > system.time(generateIndex5(100))
[1] 0.02 0.00 0.02   NA   NA

Indeed! That code is much, much faster!
Now, let's compare generateIndex1() with generateIndex5().

- generateIndex5() is optimized for speed (4.86/0.02, about 250 times 
faster!)

- generateIndex5() is more concise code: 4 lines, no loops, compared to 
8 lines with two loops.

- but... generateIndex1() is the code that comes to mind more easily 
(except, perhaps for some R experts (?) because thinking with vectors is 
their second nature).

- but... generateIndex1() is much easier to understand, when someone 
else read the code (for the same reason).

Final conclusion:
generateIndex5() is a better R code (I am sure one can do even better!), 
but it is a little bit more intellectual work to arrive to this result 
(i.e., rethink the problem using matrix calculation). However, the 
result is worth the effort.

(note: this will be introduced in the future R Wiki. This is the reson 
why this email is so long: I took a good occasion to speak about code 
optimization).

Best,

Philippe Grosjean

..............................................<°}))><........
  ) ) ) ) )
( ( ( ( (    Prof. Philippe Grosjean
  ) ) ) ) )
( ( ( ( (    Numerical Ecology of Aquatic Systems
  ) ) ) ) )   Mons-Hainaut University, Pentagone (3D08)
( ( ( ( (
..............................................................

jim holtman wrote:
> Is this what you want?  It returns a character vector with the values:
> 
> 
>>generate.index<-function(n.item){
> 
> + .return <- character()  # initialize vector
> + for (i in 1:n.item)
> +    {
> +        for (j in ((i+1):n.item))
> +            {
> + # concatenate the results
> + .return <- c(.return,
> paste("i",formatC(i,digits=2,flag="0"),".",formatC(j,digits=2,flag="0"),sep=""))
> +
> +            }
> +
> +    }
> +    .return
> +  }
> 
>>
>>generate.index(10)
> 
>  [1] "i001.002" "i001.003" "i001.004" "i001.005" "i001.006" "i001.007"
>  [7] "i001.008" "i001.009" "i001.010" "i002.003" "i002.004" "i002.005"
> [13] "i002.006" "i002.007" "i002.008" "i002.009" "i002.010" "i003.004"
> [19] "i003.005" "i003.006" "i003.007" "i003.008" "i003.009" "i003.010"
> [25] "i004.005" "i004.006" "i004.007" "i004.008" "i004.009" "i004.010"
> [31] "i005.006" "i005.007" "i005.008" "i005.009" "i005.010" "i006.007"
> [37] "i006.008" "i006.009" "i006.010" "i007.008" "i007.009" "i007.010"
> [43] "i008.009" "i008.010" "i009.010" "i010.011" "i010.010"
> 
> 
> 
> 
> On 2/4/06, Taka Matzmoto <sell_mirage_ne at hotmail.com> wrote:
> 
>>Hi R users
>>
>>I wrote a function that generates some character strings.
>>
>>generate.index<-function(n.item){
>>for (i in 1:n.item)
>>   {
>>       for (j in ((i+1):n.item))
>>           {
>>
>>
>>cat("i",formatC(i,digits=2,flag="0"),".",formatC(j,digits=2,flag="0"),"\n",sep="")
>>
>>           }
>>
>>   }
>>                               }
>>
>>I like to save what appears on the screen when I run using
>>generate.index(10) as a character vector
>>
>>I used
>>temp <- generate.index(10)
>>
>>but it didn't work.
>>
>>Could you provide some advice on this issue?
>>
>>Thanks in advance
>>
>>TM
>>
>>______________________________________________
>>R-help at stat.math.ethz.ch mailing list
>>https://stat.ethz.ch/mailman/listinfo/r-help
>>PLEASE do read the posting guide!
>>http://www.R-project.org/posting-guide.html
>>
> 
> 
> 
> 
> --
> Jim Holtman
> Cincinnati, OH
> +1 513 247 0281
> 
> What the problem you are trying to solve?
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
> 
>




More information about the R-help mailing list