[R] vectorized sub, gsub, grep, etc.
Adam Erickson
adam.michael.erickson at gmail.com
Wed Jul 29 23:42:23 CEST 2015
Further refining the vectorized (within a loop) exact string match
function, I get times below 0.9 seconds while maintaining error checking.
This is accomplished by removing which() and replacing 1:length() with
seq_along().
sub2 <- function(pattern, replacement, x) {
len <- length(x)
y <- character(length=len)
patlen <- length(pattern)
replen <- length(replacement)
if(patlen != replen) stop('Error: Pattern and replacement length do not
match')
for(i in seq_along(pattern)) {
y[x==pattern[i]] <- replacement[i]
}
return(y)
}
system.time(for(i in 1:50000) sub2(patt, repl, X))
user system elapsed
0.86 0.00 0.86
Since the ordered vectors are perfectly aligned, might as well do an exact
string match. Hence, I think this is not off-topic.
Cheers,
Adam
On Wednesday, July 29, 2015 at 8:15:52 AM UTC-7, Bert Gunter wrote:
>
> There is confusion here. apply() family functions are **NOT**
> vectorization -- they ARE loops (at the interpreter level), just done
> in "functionalized" form. Please read background material (John
> Chambers's books, MASS, or numerous others) to improve your
> understanding and avoid posting erroneous comments.
>
> Cheers,
> Bert
>
>
> Bert Gunter
>
> "Data is not information. Information is not knowledge. And knowledge
> is certainly not wisdom."
> -- Clifford Stoll
>
>
> On Tue, Jul 28, 2015 at 3:00 PM, John Thaden <jjth... at flash.net
> <javascript:>> wrote:
> > Adam, The method you propose gives a different result than the prior
> methods for these example vectors
> > X <- c("ab", "cd", "ef")
> > patt <- c("b", "cd", "a")
> > repl <- c("B", "CD", "A")
> >
> > Old method 1
> >
> > mapply(function(p, r, x) sub(p, r, x, fixed = TRUE), p=patt, r=repl,
> x=X)
> > gives
> > b cd a
> > "aB" "CD" "ef"
> >
> > Old method 2
> >
> > sub2 <- function(pattern, replacement, x) {
> > len <- length(x)
> > if (length(pattern) == 1)
> > pattern <- rep(pattern, len)
> > if (length(replacement) == 1)
> > replacement <- rep(replacement, len)
> > FUN <- function(i, ...) {
> > sub(pattern[i], replacement[i], x[i], fixed = TRUE)
> > }
> > idx <- 1:length(x)
> > sapply(idx, FUN)
> > }
> > sub2(patt, repl, X)
> > gives
> > [1] "aB" "CD" "ef"
> >
> > Your method (I gave it the unique name "sub3")
> > sub3 <- function(pattern, replacement, x) { len <- length(x) y
> <- character(length=len) patlen <- length(pattern) replen <-
> length(replacement) if(patlen != replen) stop('Error: Pattern and
> replacement length do not match') for(i in 1:replen) {
> y[which(x==pattern[i])] <- replacement[i] } return(y)}sub3(patt, repl,
> X)
> > gives[1] "" "CD" ""
> >
> > Granted, whatever it does, it does it faster
> > #Old method 1
> > system.time(for(i in 1:50000)
> > mapply(function(p,r,x) sub(p,r,x, fixed = TRUE),p=patt,r=repl,x=X))
> > user system elapsed
> > 2.53 0.00 2.52
> >
> > #Old method 2
> > system.time(for(i in 1:50000)sub2(patt, repl, X)) user system elapsed
> > 2.32 0.00 2.32
> >
> > #Your proposed method
> > system.time(for(i in 1:50000) sub3(patt, repl, X))
> > user system elapsed
> > 1.02 0.00 1.01
> > but would it still be faster if it actually solved the same problem?
> >
> > -John Thaden
> >
> >
> >
> >
> > On Monday, July 27, 2015 11:40 PM, Adam Erickson <
> adam.micha... at gmail.com <javascript:>> wrote:
> >
> > I know this is an old thread, but I wrote a simple FOR loop with
> vectorized pattern replacement that is much faster than either of those (it
> can also accept outputs differing in length from the patterns):
> > sub2 <- function(pattern, replacement, x) { len <- length(x)
> y <- character(length=len) patlen <- length(pattern) replen <-
> length(replacement) if(patlen != replen) stop('Error: Pattern and
> replacement length do not match') for(i in 1:replen) {
> y[which(x==pattern[i])] <- replacement[i] } return(y) }
> > system.time(test <- sub2(patt, repl, XX)) user system elapsed 0
> 0 0
> > Cheers,
> > Adam
> > On Wednesday, October 8, 2008 at 9:38:01 PM UTC-7, john wrote:
> > Hello Christos,
> > To my surprise, vectorization actually hurt processing speed!#Example
> > X <- c("ab", "cd", "ef")
> > patt <- c("b", "cd", "a")
> > repl <- c("B", "CD", "A")sub2 <- function(pattern, replacement, x) {
> > len <- length(x)
> > if (length(pattern) == 1)
> > pattern <- rep(pattern, len)
> > if (length(replacement) == 1)
> > replacement <- rep(replacement, len)
> > FUN <- function(i, ...) {
> > sub(pattern[i], replacement[i], x[i], fixed = TRUE)
> > }
> > idx <- 1:length(x)
> > sapply(idx, FUN)
> > }
> >
> > system.time( for(i in 1:10000) sub2(patt, repl, X) )
> > user system elapsed
> > 1.18 0.07 1.26 system.time( for(i in 1:10000)
> mapply(function(p, r, x) sub(p, r, x, fixed = TRUE), p=patt, r=repl, x=X)
> )
> > user system elapsed
> > 1.42 0.05 1.47
> >
> > So much for avoiding loops.
> > John Thaden======= At 2008-10-07, 14:58:10 Christos wrote: =======>John,
> >>Try the following:
> >>
> >> mapply(function(p, r, x) sub(p, r, x, fixed = TRUE), p=patt, r=repl,
> x=X)
> >> b cd a
> >>"aB" "CD" "ef"
> >>
> >>-Christos>> -----My Original Message-----
> >>> R pattern-matching and replacement functions are
> >>> vectorized: they can operate on vectors of targets.
> >>> However, they can only use one pattern and replacement.
> >>> Here is code to apply a different pattern and replacement for
> >>> every target. My question: can it be done better?
> >>>
> >>> sub2 <- function(pattern, replacement, x) {
> >>> len <- length(x)
> >>> if (length(pattern) == 1)
> >>> pattern <- rep(pattern, len)
> >>> if (length(replacement) == 1)
> >>> replacement <- rep(replacement, len)
> >>> FUN <- function(i, ...) {
> >>> sub(pattern[i], replacement[i], x[i], fixed = TRUE)
> >>> }
> >>> idx <- 1:length(x)
> >>> sapply(idx, FUN)
> >>> }
> >>>
> >>> #Example
> >>> X <- c("ab", "cd", "ef")
> >>> patt <- c("b", "cd", "a")
> >>> repl <- c("B", "CD", "A")
> >>> sub2(patt, repl, X)
> >>>
> >>> -John______________________________________________
> > R-h... at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
> >
> >
> >
> > [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-h... at r-project.org <javascript:> mailing list -- To UNSUBSCRIBE and
> more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-h... at r-project.org <javascript:> mailing list -- To UNSUBSCRIBE and
> more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list