[Rd] FR: valid_regex() to test string validity as a regular expression

Michael Chirico ch|r|com @end|ng |rom goog|e@com
Tue Oct 10 16:57:12 CEST 2023


> Grepping an empty string might work in many cases...

That's precisely why a base R offering is important, as a surer way of
validating in all cases. To be clear I am trying to directly access the
results of tre_regcomp().

> it is probably more portable to simply be prepared to propagate such
errors from the actual use on real inputs

That works best in self-contained calls -- foo(re) and we execute re inside
foo().

But the specific context where I found myself looking for a regex validator
is more complicated (https://github.com/r-lib/lintr/pull/2225). User
supplies a regular expression in a configuration file, only "later" is it
actually supplied to grepl().

Till now, we've done your suggestion -- just surface the regex error at run
time. But our goal is to make it friendlier and fail earlier at "compile
time" as the config is loaded, "long" before any regex is actually executed.

At a bare minimum this is a good place to return a classed warning (say
invalid_regex_warning) to allow finer control than tryCatch(condition=).

On Mon, Oct 9, 2023, 11:30 PM Tomas Kalibera <tomas.kalibera using gmail.com>
wrote:

>
> On 10/10/23 01:57, Michael Chirico via R-devel wrote:
>
> It will be useful to package authors trying to validate input which is
> supposed to be a valid regular expression.
>
> As near as I can tell, the only way we can do so now is to run any
> regex function and check for the warning and/or condition to bubble
> up:
>
> valid_regex <- function(str) {
>   stopifnot(is.character(str), length(str) == 1L)
>   !inherits(tryCatch(grepl(str, ""), condition = identity), "condition")
> }
>
> That's pretty hefty/inscrutable for such a simple validation. I see a
> variety of similar approaches in CRAN packages [1], all slightly
> different. It would be good for R to expose a "canonical" way to run
> this validation.
>
> At root, the problem is that R does not expose the regex compilation
> routines like 'tre_regcomp', so from the R side we have to resort to
> hacky approaches.
>
> Hi Michael,
>
> I don't think you need compilation functions for that. If a regular
> expression is found invalid by a specific third party library R uses, the
> library should return and error to R and R should return an error to you,
> and you should probably propagate that to your users. Grepping an empty
> string might work in many cases as a test, but it is probably more portable
> to simply be prepared to propagate such errors from the actual use on real
> inputs. In theory, there could be some optimization for a particular case,
> the checking may not be the same - but that is the same say for compilation
> and checking.
>
> Things get slightly complicated by encoding/useBytes modes
> (tre_regwcomp, tre_regncomp, tre_regwncomp, tre_regcompb,
> tre_regncompb; all in tre.h), but all are already present in other
> regex routines, so this is doable.
>
> Re encodings, simply R strings should be valid in their encoding. This is
> not just for regular expressions but also for anything else. You shouldn't
> assume that R can handle invalid strings in any reasonable way. Definitely
> you shouldn't try adding invalid strings in tests - behavior with invalid
> strings is unspecified. To test whether a string is valid, there is
> validEnc() (or validUTF8()). But, again, it is probably safest to propagate
> errors from the regular expression R functions (in case the checks differ,
> particularly for non-UTF-8), also, duplicating the encoding checks can be a
> non-trivial overhead.
>
> If there was a strong need to have an automated way to somehow classify
> specifically errors from the regex libraries, perhaps R could attach some
> classes to them when the library tells.
>
> Tomas
>
> Exposing a function to compile regular expressions is common in other
> languages, e.g. Go [2], Python [3], JavaScript [4].
>
> [1] https://github.com/search?q=lang%3AR+%2Fis%5Ba-zA-Z0-9._%5D*reg%5Ba-zA-Z0-9._%5D*ex.*%28%3C-%7C%3D%29%5Cs*function%2F+org%3Acran&type=code
> [2] https://pkg.go.dev/regexp#Compile
> [3] https://docs.python.org/3/library/re.html#re.compile
> [4] https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp
>
> ______________________________________________R-devel using r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-devel
>
>

	[[alternative HTML version deleted]]



More information about the R-devel mailing list