[R] Best way to test for numeric digits?
Richard O'Keefe
r@oknz @end|ng |rom gm@||@com
Thu Oct 19 02:45:09 CEST 2023
This seems unnecessarily complex. Or rather,
it pushes the complexity into an arcane notation
What we really want is something that says "here is a string,
here is a pattern, give me all the substrings that match."
What we're given is a function that tells us where those
substrings are.
# greg.matches(pattern, text)
# accepts a POSIX regular expression, pattern
# and a text to search in. Both arguments must be character strings
# (length(...) = 1) not longer vectors of strings.
# It returns a character vector of all the (non-overlapping)
# substrings of text as determined by gregexpr.
greg.matches <- function (pattern, text) {
if (length(pattern) > 1) stop("pattern has too many elements")
if (length(text) > 1) stop( "text has too many elements")
match.info <- gregexpr(pattern, text)
starts <- match.info[[1]]
stops <- attr(starts, "match.length") - 1 + starts
sapply(seq(along=starts), function (i) {
substr(text, starts[i], stops[i])
})
}
Given greg.matches, we can do the rest with very simple
and easily comprehended regular expressions.
# parse.chemical(formula)
# takes a simple chemical formula "<element><count>..." and
# returns a list with components
# $elements -- character -- the atom symbols
# $counts -- number -- the counts (missing counts taken as 1).
# BEWARE. This does not handle formulas like "CH(OH)3".
parse.chemical <- function (formula) {
parts <- greg.matches("[A-Z][a-z]*[0-9]*", formula)
elements <- gsub("[0-9]+", "", parts)
counts <- as.numeric(gsub("[^0-9]+", "", parts))
counts <- ifelse(is.na(counts), 1, counts)
list(elements=elements, counts=counts)
}
> parse.chemical("CCl3F")
$elements
[1] "C" "Cl" "F"
$counts
[1] 1 3 1
> parse.chemical("Li4Al4H16")
$elements
[1] "Li" "Al" "H"
$counts
[1] 4 4 16
> parse.chemical("CCl2CO2AlPO4SiO4Cl")
$elements
[1] "C" "Cl" "C" "O" "Al" "P" "O" "Si" "O" "Cl"
$counts
[1] 1 2 1 2 1 1 4 1 4 1
On Thu, 19 Oct 2023 at 03:59, Leonard Mada via R-help <r-help using r-project.org>
wrote:
> Dear List members,
>
> What is the best way to test for numeric digits?
>
> suppressWarnings(as.double(c("Li", "Na", "K", "2", "Rb", "Ca", "3")))
> # [1] NA NA NA 2 NA NA 3
> The above requires the use of the suppressWarnings function. Are there
> any better ways?
>
> I was working to extract chemical elements from a formula, something
> like this:
> split.symbol.character = function(x, rm.digits = TRUE) {
> # Perl is partly broken in R 4.3, but this works:
> regex = "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
> # stringi::stri_split(x, regex = regex);
> s = strsplit(x, regex, perl = TRUE);
> if(rm.digits) {
> s = lapply(s, function(s) {
> isNotD = is.na(suppressWarnings(as.numeric(s)));
> s = s[isNotD];
> });
> }
> return(s);
> }
>
> split.symbol.character(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"))
>
>
> Sincerely,
>
>
> Leonard
>
>
> Note:
> # works:
> regex = "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
> strsplit(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"), regex, perl = T)
>
>
> # broken in R 4.3.1
> # only slightly "erroneous" with stringi::stri_split
> regex = "(?<=[A-Z])(?![a-z]|$)|(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
> strsplit(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"), regex, perl = T)
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
[[alternative HTML version deleted]]
More information about the R-help
mailing list