[Rd] FR: valid_regex() to test string validity as a regular expression

Toby Hocking tdhock5 @end|ng |rom gm@||@com
Thu Oct 12 01:02:06 CEST 2023


Hi Michael, it sounds like you don't want to use a CRAN package for
this, but you may try re2, see below.

> grepl("(invalid","subject",perl=TRUE)
Error in grepl("(invalid", "subject", perl = TRUE) :
  invalid regular expression '(invalid'
In addition: Warning message:
In grepl("(invalid", "subject", perl = TRUE) :
  PCRE pattern compilation error
    'missing closing parenthesis'
    at ''

> grepl("(invalid","subject",perl=FALSE)
Error in grepl("(invalid", "subject", perl = FALSE) :
  invalid regular expression '(invalid', reason 'Missing ')''
In addition: Warning message:
In grepl("(invalid", "subject", perl = FALSE) :
  TRE pattern compilation error 'Missing ')''

> re2::re2_regexp("(invalid")
Error: missing ): (invalid

On Tue, Oct 10, 2023 at 7:57 AM Michael Chirico via R-devel
<r-devel using r-project.org> wrote:
>
> > Grepping an empty string might work in many cases...
>
> That's precisely why a base R offering is important, as a surer way of
> validating in all cases. To be clear I am trying to directly access the
> results of tre_regcomp().
>
> > it is probably more portable to simply be prepared to propagate such
> errors from the actual use on real inputs
>
> That works best in self-contained calls -- foo(re) and we execute re inside
> foo().
>
> But the specific context where I found myself looking for a regex validator
> is more complicated (https://github.com/r-lib/lintr/pull/2225). User
> supplies a regular expression in a configuration file, only "later" is it
> actually supplied to grepl().
>
> Till now, we've done your suggestion -- just surface the regex error at run
> time. But our goal is to make it friendlier and fail earlier at "compile
> time" as the config is loaded, "long" before any regex is actually executed.
>
> At a bare minimum this is a good place to return a classed warning (say
> invalid_regex_warning) to allow finer control than tryCatch(condition=).
>
> On Mon, Oct 9, 2023, 11:30 PM Tomas Kalibera <tomas.kalibera using gmail.com>
> wrote:
>
> >
> > On 10/10/23 01:57, Michael Chirico via R-devel wrote:
> >
> > It will be useful to package authors trying to validate input which is
> > supposed to be a valid regular expression.
> >
> > As near as I can tell, the only way we can do so now is to run any
> > regex function and check for the warning and/or condition to bubble
> > up:
> >
> > valid_regex <- function(str) {
> >   stopifnot(is.character(str), length(str) == 1L)
> >   !inherits(tryCatch(grepl(str, ""), condition = identity), "condition")
> > }
> >
> > That's pretty hefty/inscrutable for such a simple validation. I see a
> > variety of similar approaches in CRAN packages [1], all slightly
> > different. It would be good for R to expose a "canonical" way to run
> > this validation.
> >
> > At root, the problem is that R does not expose the regex compilation
> > routines like 'tre_regcomp', so from the R side we have to resort to
> > hacky approaches.
> >
> > Hi Michael,
> >
> > I don't think you need compilation functions for that. If a regular
> > expression is found invalid by a specific third party library R uses, the
> > library should return and error to R and R should return an error to you,
> > and you should probably propagate that to your users. Grepping an empty
> > string might work in many cases as a test, but it is probably more portable
> > to simply be prepared to propagate such errors from the actual use on real
> > inputs. In theory, there could be some optimization for a particular case,
> > the checking may not be the same - but that is the same say for compilation
> > and checking.
> >
> > Things get slightly complicated by encoding/useBytes modes
> > (tre_regwcomp, tre_regncomp, tre_regwncomp, tre_regcompb,
> > tre_regncompb; all in tre.h), but all are already present in other
> > regex routines, so this is doable.
> >
> > Re encodings, simply R strings should be valid in their encoding. This is
> > not just for regular expressions but also for anything else. You shouldn't
> > assume that R can handle invalid strings in any reasonable way. Definitely
> > you shouldn't try adding invalid strings in tests - behavior with invalid
> > strings is unspecified. To test whether a string is valid, there is
> > validEnc() (or validUTF8()). But, again, it is probably safest to propagate
> > errors from the regular expression R functions (in case the checks differ,
> > particularly for non-UTF-8), also, duplicating the encoding checks can be a
> > non-trivial overhead.
> >
> > If there was a strong need to have an automated way to somehow classify
> > specifically errors from the regex libraries, perhaps R could attach some
> > classes to them when the library tells.
> >
> > Tomas
> >
> > Exposing a function to compile regular expressions is common in other
> > languages, e.g. Go [2], Python [3], JavaScript [4].
> >
> > [1] https://github.com/search?q=lang%3AR+%2Fis%5Ba-zA-Z0-9._%5D*reg%5Ba-zA-Z0-9._%5D*ex.*%28%3C-%7C%3D%29%5Cs*function%2F+org%3Acran&type=code
> > [2] https://pkg.go.dev/regexp#Compile
> > [3] https://docs.python.org/3/library/re.html#re.compile
> > [4] https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp
> >
> > ______________________________________________R-devel using r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-devel
> >
> >
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel



More information about the R-devel mailing list