[R] Profiling question: string formatting extremely slow
jim holtman
jholtman at gmail.com
Wed Mar 18 18:09:37 CET 2009
Try this way. Took less than 1 second for 50,000
> system.time({
+ x <- sample(50000) # test data
+ x[sample(50000,10000)] <- 'asdfasdf' # characters strings
+ which.num <- grep("^[ 0-9]+$", x) # find numbers
+ # convert to leading 0
+ x[which.num] <- sprintf("%018.0f", as.numeric(x[which.num]))
+ x[-which.num] <- toupper(x[-which.num])
+ })
user system elapsed
0.25 0.00 0.25
>
>
>
> head(x,30)
[1] "000000000000026550" "000000000000019100" "000000000000045961"
"000000000000031473" "000000000000005031" "000000000000012266"
[7] "000000000000034418" "000000000000042279" "000000000000041193"
"ASDFASDF" "000000000000005760" "000000000000035659"
[13] "ASDFASDF" "000000000000008420" "000000000000042220"
"ASDFASDF" "000000000000039903" "000000000000032234"
[19] "000000000000024125" "000000000000032970" "000000000000006814"
"000000000000000215" "ASDFASDF" "000000000000045239"
[25] "ASDFASDF" "ASDFASDF" "000000000000043065"
"ASDFASDF" "000000000000007642" "000000000000019196"
>
On Wed, Mar 18, 2009 at 12:16 PM, Olivier Boudry
<olivier.boudry at gmail.com> wrote:
> Hi all,
>
> I'm using R to find duplicates in a set of 6 files containing Part Number
> information. Before applying the intersect method to identify the duplicates
> I need to normalize the P/Ns. Converting the P/N to uppercase if
> alphanumerical and applying an 18 char long zero padding if numerical.
>
> When I apply the pn_formatting function (see code below) to "Part Number"
> column of the data.frame (character vectors up to 18 char long) it consumes
> a lot of memory, my computer (Windows XP SP3) starts to swap memory, CPU
> goes to zero and completion takes hours to complete. Part Number columns can
> have from 7'000 to 80'000 records and I've never got enough patience to wait
> for completion of more than 17'000 records.
>
> Is there a way to find out which of the function used below is the
> bottleneck, as.integer, is.na, sub, paste, nchar, toupper? Is there a
> profiler for R and if yes where could I find some documentation on how to
> use it?
>
> The code:
>
> # String contains digits only (can be converted to an integer)
> digits_only <- function(x) { suppressWarnings(!is.na(as.integer(x))) }
>
> # Remove blanks at both ends of a string
> trim <- function (x) {
> sub("^\\s+((.*\\S)\\s+)?$", "\\2", x)
> }
>
> # P/N formatting
> pn_formatting <- function(pn_in) {
>
> pn_out = trim(pn_in)
> if (digits_only(pn_out)) {
>
> # Zero padding
> pn_out <- paste("000000000000000000", pn_out, sep="")
> pn_len <- nchar(pn_out)
> pn_out <- substr(pn_out, pn_len - 17, pn_len)
>
> } else {
> # Uppercase
> pn_out <- toupper(pn_out)
> }
> pn_out
> }
>
> Thanks,
>
> Olivier.
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
--
Jim Holtman
Cincinnati, OH
+1 513 646 9390
What is the problem that you are trying to solve?
More information about the R-help
mailing list