[R] Profiling question: string formatting extremely slow

jim holtman jholtman at gmail.com
Wed Mar 18 18:09:37 CET 2009


Try this way.  Took less than 1 second for 50,000

> system.time({
+     x <- sample(50000)  # test data
+     x[sample(50000,10000)] <- 'asdfasdf'  # characters strings
+     which.num <- grep("^[ 0-9]+$", x)  # find numbers
+     # convert to leading 0
+     x[which.num] <- sprintf("%018.0f", as.numeric(x[which.num]))
+     x[-which.num] <- toupper(x[-which.num])
+ })
   user  system elapsed
   0.25    0.00    0.25
>
>
>
> head(x,30)
 [1] "000000000000026550" "000000000000019100" "000000000000045961"
"000000000000031473" "000000000000005031" "000000000000012266"
 [7] "000000000000034418" "000000000000042279" "000000000000041193"
"ASDFASDF"           "000000000000005760" "000000000000035659"
[13] "ASDFASDF"           "000000000000008420" "000000000000042220"
"ASDFASDF"           "000000000000039903" "000000000000032234"
[19] "000000000000024125" "000000000000032970" "000000000000006814"
"000000000000000215" "ASDFASDF"           "000000000000045239"
[25] "ASDFASDF"           "ASDFASDF"           "000000000000043065"
"ASDFASDF"           "000000000000007642" "000000000000019196"
>


On Wed, Mar 18, 2009 at 12:16 PM, Olivier Boudry
<olivier.boudry at gmail.com> wrote:
> Hi all,
>
> I'm using R to find duplicates in a set of 6 files containing Part Number
> information. Before applying the intersect method to identify the duplicates
> I need to normalize the P/Ns. Converting the P/N to uppercase if
> alphanumerical and applying an 18 char long zero padding if numerical.
>
> When I apply the pn_formatting function (see code below) to "Part Number"
> column of the data.frame (character vectors up to 18 char long) it consumes
> a lot of memory, my computer (Windows XP SP3) starts to swap memory, CPU
> goes to zero and completion takes hours to complete. Part Number columns can
> have from 7'000 to 80'000 records and I've never got enough patience to wait
> for completion of more than 17'000 records.
>
> Is there a way to find out which of the function used below is the
> bottleneck, as.integer, is.na, sub, paste, nchar, toupper? Is there a
> profiler for R and if yes where could I find some documentation on how to
> use it?
>
> The code:
>
> # String contains digits only (can be converted to an integer)
> digits_only <- function(x) { suppressWarnings(!is.na(as.integer(x))) }
>
> # Remove blanks at both ends of a string
> trim <- function (x) {
>  sub("^\\s+((.*\\S)\\s+)?$", "\\2", x)
> }
>
> # P/N formatting
> pn_formatting <- function(pn_in) {
>
>  pn_out = trim(pn_in)
>  if (digits_only(pn_out)) {
>
>    # Zero padding
>    pn_out <- paste("000000000000000000", pn_out, sep="")
>    pn_len <- nchar(pn_out)
>    pn_out <- substr(pn_out, pn_len - 17, pn_len)
>
>  } else {
>    # Uppercase
>    pn_out <- toupper(pn_out)
>  }
>  pn_out
> }
>
> Thanks,
>
> Olivier.
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?




More information about the R-help mailing list