[R] Is there a way to vectorize this? [with correction]

Gabor Grothendieck ggrothendieck at gmail.com
Sat Nov 1 18:47:44 CET 2008


Here is a function that has arguments similar to gsub.  The first is the
pattern where the portion to actually be replaced should be in
parentheses and the others are the replacement string and the text:

library(gsubfn)
replace.in.context <- function(pattern, replacement, x, ...) {
	gsubfn(pattern, m + b ~ sub(b, replacement, m), x, backref = 1, ...)
}

txt <- "algnmark  align=left algnmark"
new.align <- "left"
replace.in.context("algnmark  align=([a-z]+) algnmark", new.align, txt)


On Sat, Nov 1, 2008 at 12:20 PM, Duncan Temple Lang
<duncan at wald.ucdavis.edu> wrote:
>
>
> Nutter, Benjamin wrote:
>>
>> ** Sorry to repost.  I forgot to include a function necessary to make
>> the example work **
>>
>> I apologize up front for this being a little long. I hope it's
>> understandable.  Please let me know if I need to clarify anything.
>>
>> Several months ago I wrote a series of functions to help me take my R
>> analyses and build custom reports in html files.  Each function either
>> builds or modifies a string of html code that can then be written to a
>> file to produce the desired output.
>>
>> To make modifications in the html code, I've placed 'markers' around
>> certain characteristics that I might want to change.  For instance, the
>> alignment characteristics have an 'algnmark' on either side of them.
>> When I wish to change the alignment, I can find where these markers are,
>> determine their location, and replace the contents between them.
>> I've been using the functions for a few months now, and am pleased with
>> the utility.  Unfortunately, as I was writing these, I wasn't very
>> strong with my vectorization skills and relied on for loops (lots of for
>> loops) to get through the work.  So while I'm pleased with the utility,
>> I've been trying to optimize the functions by vectorizing the for loops.
>>
>> At this point, I've hit a small snag.  I have a situation where I can't
>> seem to figure out how to vectorize the loop.  Part of me wonders if it
>> is even possible.
>> The scenario is this:  I run a string of code through the loop, on each
>> pass, the section of code in need of modification is identified and the
>> changes are made.  When this is done, however, the length of the string
>> changes.  The change in length needs to be recognized in the next pass
>> through the loop.
>
> At a quick glance, it seems  merely trying to transform each instance of
>
>  algnmark  align=left algnmark
>
> to
>
>  algnmark  align=right algnmark
>
> If so, you are going about this in an unnecessarily complicated manner.
>
> html.text = function(text, new.align)
>  gsub("algnmark  align=[a-z]+ algnmark",
>       paste("algnmark  align=", new.align, " algnmark", sep = ""),
>       text)

Here are a few alternatives.  For all of them we assume:

txt <- "algnmark  align=right algnmark"
new.align <- "left"

Their main advantage is that the context need not be
written out twice which might help avoid errors:

1. This solution avoids repeating the context explicitly:

gsub("(algnmark  align=)[a-z]+( algnmark)",
  paste("\\1", new.align, "\\2", sep = ""), txt)

2. zero-width perl regexps could be used here:

gsub("(?<=algnmark  align=)[a-z]+(?= algnmark)", new.align, txt, perl = TRUE)

This has the advantage that the replacement string is just new.align but
does require marking up the regexp slightly more.

3. Another possibility is to use the gsubfn package.  gsubfn is like
gsub except the replacement string is a function.   The portion of the regular
expression in parentheses is known as the back reference and the entire
string matched by the regular expression is called the match.  backref
= 1 says pass
the match and 1 back reference to the function.  gsubfn accepts a formula
notation for functions (or ordinary notation) and using that we define
the function
to use sub to replace the back reference with new.align in the match:

gsubfn("algnmark  align=([a-z]+) algnmark", m+b ~ sub(b, new.align,
m), txt, backref = 1)

This gives a regexp which is nearly as simple as Thomas' while
avoiding explicit repetition of the context in the replace
ment.



More information about the R-help mailing list