[R] Selecting columns whose names contain "mutated" except when they also contain "non" or "un"

Greg Snow 538280 at gmail.com
Thu Apr 26 20:55:12 CEST 2012


Sorry I took so long getting back to this, but the paying job needs to
take priority.

The regular expression "(?<!un)(?<!non)muta"  looks for a string that
matches "muta" then looks at the characters immediately before it to
see if they match either "un" or "non" in which case it makes it a not
match.  More specifically the regular expression engine steps through
the string and at each point tries the match, so at a given point it
will first see if "un" is before that point, if it is then this point
can't match and it moves the checking point, if it is not "un" then it
moves to the next negative look behind and sees if "non" is just
before the point.  If neither "un" or "non" are just before the point
then it starts matching characters after the point to see if they
match "muta".

So the next pattern is "(?!muta)non|un", the (?!muta) is a negative
look ahead which starts at the point and checks forward to see that
the next characters are not "muta" (but does not include them in the
match), in this case it is a no-op because you are saying that you
want to match at a point where the next characters are not "muta" but
are "non"  and since the next set of characters cannot be both this is
the same as just matching "non", also you need to be aware of the
operator precedence, in that pattern the (?!muta) part only applied to
the "non", not the "un".

To match "nonmuta" or "unmuta" a simple pattern would just be
"(non|un)muta" or "(no|u)nmuta".  You could use the positive
lookbehind (you would still need an "or"), but it would be overkill
for a grep command.  The difference in the positive look ahead/behind
is more important for replacing where the look ahead/behind is needed
for the match to happen, but is not captured as part of the match to
be replaced.



On Tue, Apr 24, 2012 at 7:40 AM, Paul Miller <pjmiller_57 at yahoo.com> wrote:
> Hi Greg,
>
> This is quite helpful. Not so good yet with regular expressions in general or Perl-like regular expressions. Found the help page though, and think I was able to determine how the code works as well as how I would select only instances where "muta" is preceeded by either "non" or "un".
>
>> (tmp <- c('mutation','nonmutated','unmutated','verymutated','other'))
> [1] "mutation"    "nonmutated"  "unmutated"   "verymutated" "other"
>
>> grep("(?<!un)(?<!non)muta", tmp, perl=TRUE)
> [1] 1 4
>
>> grep("(?!muta)non|un", tmp, perl=TRUE)
> [1] 2 3
>
> Did I get the second grep right?
>
> If so, do you have any sense of why it seems to fail when I apply it to my data?
>
>> KRASyn$NonMutant_comb <- rowSums(KRASyn[grep("(?!muta)non|un", names(KRASyn), perl=TRUE)])
>
> Error in rowSums(KRASyn[grep("(?!muta)non|un", names(KRASyn), perl = TRUE)]) :
>  'x' must be numeric
>
> Thanks,
>
> Paul
>



-- 
Gregory (Greg) L. Snow Ph.D.
538280 at gmail.com



More information about the R-help mailing list