[R] Best way to test for numeric digits?

Thu Oct 19 02:45:09 CEST 2023

This seems unnecessarily complex.  Or rather,
it pushes the complexity into an arcane notation
What we really want is something that says "here is a string,
here is a pattern, give me all the substrings that match."
What we're given is a function that tells us where those
substrings are.

# greg.matches(pattern, text)
# accepts a POSIX regular expression, pattern
# and a text to search in.  Both arguments must be character strings
# (length(...) = 1) not longer vectors of strings.
# It returns a character vector of all the (non-overlapping)
# substrings of text as determined by gregexpr.

greg.matches <- function (pattern, text) {
    if (length(pattern) > 1) stop("pattern has too many elements")
    if (length(text)    > 1) stop(   "text has too many elements")
    match.info <- gregexpr(pattern, text)
    starts <- match.info[[1]]
    stops <- attr(starts, "match.length") - 1 + starts
    sapply(seq(along=starts), function (i) {
       substr(text, starts[i], stops[i])
    })
}

Given greg.matches, we can do the rest with very simple
and easily comprehended regular expressions.

# parse.chemical(formula)
# takes a simple chemical formula "<element><count>..." and
# returns a list with components
# $elements -- character -- the atom symbols
# $counts   -- number    -- the counts (missing counts taken as 1).
# BEWARE.  This does not handle formulas like "CH(OH)3".

parse.chemical <- function (formula) {
    parts <- greg.matches("[A-Z][a-z]*[0-9]*", formula)
    elements <- gsub("[0-9]+", "", parts)
    counts <- as.numeric(gsub("[^0-9]+", "", parts))
    counts <- ifelse(is.na(counts), 1, counts)
    list(elements=elements, counts=counts)
}

> parse.chemical("CCl3F")
$elements
[1] "C"  "Cl" "F"

$counts
[1] 1 3 1

> parse.chemical("Li4Al4H16")
$elements
[1] "Li" "Al" "H"

$counts
[1]  4  4 16

> parse.chemical("CCl2CO2AlPO4SiO4Cl")
$elements
 [1] "C"  "Cl" "C"  "O"  "Al" "P"  "O"  "Si" "O"  "Cl"

$counts
 [1] 1 2 1 2 1 1 4 1 4 1

On Thu, 19 Oct 2023 at 03:59, Leonard Mada via R-help <r-help using r-project.org>
wrote:

> Dear List members,
>
> What is the best way to test for numeric digits?
>
> suppressWarnings(as.double(c("Li", "Na", "K",  "2", "Rb", "Ca", "3")))
> # [1] NA NA NA  2 NA NA  3
> The above requires the use of the suppressWarnings function. Are there
> any better ways?
>
> I was working to extract chemical elements from a formula, something
> like this:
> split.symbol.character = function(x, rm.digits = TRUE) {
>      # Perl is partly broken in R 4.3, but this works:
>      regex = "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
>      # stringi::stri_split(x, regex = regex);
>      s = strsplit(x, regex, perl = TRUE);
>      if(rm.digits) {
>          s = lapply(s, function(s) {
>              isNotD = is.na(suppressWarnings(as.numeric(s)));
>              s = s[isNotD];
>          });
>      }
>      return(s);
> }
>
> split.symbol.character(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"))
>
>
> Sincerely,
>
>
> Leonard
>
>
> Note:
> # works:
> regex = "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
> strsplit(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"), regex, perl = T)
>
>
> # broken in R 4.3.1
> # only slightly "erroneous" with stringi::stri_split
> regex = "(?<=[A-Z])(?![a-z]|$)|(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
> strsplit(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"), regex, perl = T)
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

	[[alternative HTML version deleted]]