[Rd] FR: valid_regex() to test string validity as a regular expression

Tomas Kalibera tom@@@k@||ber@ @end|ng |rom gm@||@com
Tue Oct 10 08:30:06 CEST 2023


On 10/10/23 01:57, Michael Chirico via R-devel wrote:
> It will be useful to package authors trying to validate input which is
> supposed to be a valid regular expression.
>
> As near as I can tell, the only way we can do so now is to run any
> regex function and check for the warning and/or condition to bubble
> up:
>
> valid_regex <- function(str) {
>    stopifnot(is.character(str), length(str) == 1L)
>    !inherits(tryCatch(grepl(str, ""), condition = identity), "condition")
> }
>
> That's pretty hefty/inscrutable for such a simple validation. I see a
> variety of similar approaches in CRAN packages [1], all slightly
> different. It would be good for R to expose a "canonical" way to run
> this validation.
>
> At root, the problem is that R does not expose the regex compilation
> routines like 'tre_regcomp', so from the R side we have to resort to
> hacky approaches.

Hi Michael,

I don't think you need compilation functions for that. If a regular 
expression is found invalid by a specific third party library R uses, 
the library should return and error to R and R should return an error to 
you, and you should probably propagate that to your users. Grepping an 
empty string might work in many cases as a test, but it is probably more 
portable to simply be prepared to propagate such errors from the actual 
use on real inputs. In theory, there could be some optimization for a 
particular case, the checking may not be the same - but that is the same 
say for compilation and checking.

> Things get slightly complicated by encoding/useBytes modes
> (tre_regwcomp, tre_regncomp, tre_regwncomp, tre_regcompb,
> tre_regncompb; all in tre.h), but all are already present in other
> regex routines, so this is doable.

Re encodings, simply R strings should be valid in their encoding. This 
is not just for regular expressions but also for anything else. You 
shouldn't assume that R can handle invalid strings in any reasonable 
way. Definitely you shouldn't try adding invalid strings in tests - 
behavior with invalid strings is unspecified. To test whether a string 
is valid, there is validEnc() (or validUTF8()). But, again, it is 
probably safest to propagate errors from the regular expression R 
functions (in case the checks differ, particularly for non-UTF-8), also, 
duplicating the encoding checks can be a non-trivial overhead.

If there was a strong need to have an automated way to somehow classify 
specifically errors from the regex libraries, perhaps R could attach 
some classes to them when the library tells.

Tomas

> Exposing a function to compile regular expressions is common in other
> languages, e.g. Go [2], Python [3], JavaScript [4].
>
> [1]https://github.com/search?q=lang%3AR+%2Fis%5Ba-zA-Z0-9._%5D*reg%5Ba-zA-Z0-9._%5D*ex.*%28%3C-%7C%3D%29%5Cs*function%2F+org%3Acran&type=code
> [2]https://pkg.go.dev/regexp#Compile
> [3]https://docs.python.org/3/library/re.html#re.compile
> [4]https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp
>
> ______________________________________________
> R-devel using r-project.org  mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
	[[alternative HTML version deleted]]



More information about the R-devel mailing list