[R] please comment on my function

jim holtman jholtman at gmail.com
Fri Sep 14 19:10:37 CEST 2012


First thing to do is to run Rprof and see where the time is going;
here it is from my computer:

                      self.time self.pct total.time total.pct
tolower                    4.42    39.46       4.42     39.46
sub                        3.56    31.79       3.56     31.79
nchar                      1.54    13.75       1.54     13.75
canonicalize.language      0.62     5.54      11.14     99.46
!=                         0.52     4.64       0.52      4.64
==                         0.26     2.32       0.26      2.32
&                          0.22     1.96       0.22      1.96
gc                         0.06     0.54       0.06      0.54

more than half the time is in 'tolower' and 'nchar', so it is not all
'sub's problem.

This version runs a little faster since it does not need the 'tolower':

canonicalize.language <- function (s) {
  # s <- tolower(s)
  long <- nchar(s) == 5
  s[long] <- sub("^([[:alpha:]]{2})[-_][[:alpha:]]{2}$","\\1",s[long])
  s[nchar(s) != 2 & s != "c"] <- "unknown"
  s
}


On Fri, Sep 14, 2012 at 12:30 PM, Sam Steingold <sds at gnu.org> wrote:
> this function is supposed to canonicalize the language:
>
> --8<---------------cut here---------------start------------->8---
> canonicalize.language <- function (s) {
>   s <- tolower(s)
>   long <- nchar(s) == 5
>   s[long] <- sub("^([a-z]{2})[-_][a-z]{2}$","\\1",s[long])
>   s[nchar(s) != 2 & s != "c"] <- "unknown"
>   s
> }
> canonicalize.language(c("aa","bb-cc","DD-abc","eee","ff_FF","C"))
> [1] "aa"      "bb"      "unknown" "unknown" "ff"      "c"
> --8<---------------cut here---------------end--------------->8---
>
> it does what I want it to do, but it takes 4.5 seconds on a vector of
> length 10,256,341 - I wonder if I might be doing something aufully stupid.
> I thought that sub() was slow, but my second attempt:
> --8<---------------cut here---------------start------------->8---
> canonicalize.language <- function (s) {
>   s <- tolower(s)
>   good <- nchar(s) == 5 & substr(s,3,3) %in% c("_","-")
>   s[good] <- substr(s[good],1,2)
>   s[nchar(s) != 2 & s != "c"] <- "unknown"
>   s
> }
> --8<---------------cut here---------------end--------------->8---
> was even slower (6.4 sec).
>
> My two concerns are:
>
> 1. avoid allocating many small objects which are never collected
> 2. run fast
>
> Which would be the best implementation?
>
> Thanks a lot for your insight!
>
> --
> Sam Steingold (http://sds.podval.org/) on Ubuntu 12.04 (precise) X 11.0.11103000
> http://www.childpsy.net/ http://think-israel.org http://openvotingconsortium.org
> http://memri.org http://camera.org http://truepeace.org
> WHO ATE MY BREAKFAST PANTS?
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.




More information about the R-help mailing list