[R] Mixed sorting/ordering of strings acknowledging roman numerals?
Henrik Bengtsson
hb at biostat.ucsf.edu
Mon Sep 8 04:40:04 CEST 2014
Thank you David - it took me awhile to get back to this and dig into
it. It's clever to imitate gtools::mixedorder() as far as possible.
A few comments:
1. It took me a while to understand why you picked 3899 in your
Roman-to-integer table; it's because roman(x) is NA for x > 3899.
(BTW, in 'utils', there's utils:::.roman2numeric() which could be
utilized, but it's currently internal.)
2. I think you forgot D=500 and M=1000.
3. There was a typo in your code; I think you meant rank.roman instead
of rank.numeric in one place.
4. The idea behind nonnumeric() is to identify non-numeric substrings
by is.na(as.numeric()). Unfortunately, for romans that does not work.
Instead, we need to use is.na(numeric(x)) here, i.e.
nonnumeric <- function(x) {
suppressWarnings(ifelse(is.na(numeric(x)), toupper(x), NA))
}
Actually, gtools::mixedorder() could use the same.
5. I undid your ".numeric" to ".roman" to minimize any differences to
gtools::mixedorder().
With the above fixes, we now have:
mixedorderRoman <- function (x)
{
if (length(x) < 1)
return(NULL)
else if (length(x) == 1)
return(1)
if (is.numeric(x))
return(order(x))
delim = "\\$\\@\\$"
# NOTE: Note that as.roman(x) is NA for x > 3899
romanC <- as.character( as.roman(1:3899) )
numeric <- function(x) {
suppressWarnings(match(x, romanC))
}
nonnumeric <- function(x) {
suppressWarnings(ifelse(is.na(numeric(x)), toupper(x),
NA))
}
x <- as.character(x)
which.nas <- which(is.na(x))
which.blanks <- which(x == "")
if (length(which.blanks) > 0)
x[which.blanks] <- -Inf
if (length(which.nas) > 0)
x[which.nas] <- Inf
delimited <- gsub("([IVXCLM]+)",
paste(delim, "\\1", delim, sep = ""), x)
step1 <- strsplit(delimited, delim)
step1 <- lapply(step1, function(x) x[x > ""])
step1.numeric <- lapply(step1, numeric)
step1.character <- lapply(step1, nonnumeric)
maxelem <- max(sapply(step1, length))
step1.numeric.t <- lapply(1:maxelem, function(i) sapply(step1.numeric,
function(x) x[i]))
step1.character.t <- lapply(1:maxelem, function(i) sapply(step1.character,
function(x) x[i]))
rank.numeric <- sapply(step1.numeric.t, rank)
rank.character <- sapply(step1.character.t, function(x)
as.numeric(factor(x)))
rank.numeric[!is.na(rank.character)] <- 0
rank.character <- t(t(rank.character) + apply(matrix(rank.numeric),
2, max, na.rm = TRUE))
rank.overall <- ifelse(is.na(rank.character), rank.numeric,
rank.character)
order.frame <- as.data.frame(rank.overall)
if (length(which.nas) > 0)
order.frame[which.nas, ] <- Inf
retval <- do.call("order", order.frame)
return(retval)
}
The difference to gtools::mixedorder() is minimal:
< romanC <- as.character( as.roman(1:3899) )
21c11
< suppressWarnings(match(x, romanC))
---
> suppressWarnings(as.numeric(x))
24c14
< suppressWarnings(ifelse(is.na(numeric(x)), toupper(x),
---
> suppressWarnings(ifelse(is.na(as.numeric(x)), toupper(x),
34c24
< delimited <- gsub("([IVXCLDM]+)",
---
> delimited <- gsub("([+-]{0,1}[0-9]+\\.{0,1}[0-9]*([eE][\\+\\-]{0,1}[0-9]+\\.{0,1}[0-9]*){0,1})",
59,62d48
This difference is so small that the above could now be an option to
mixedorder() with minimal overhead added, e.g. mixedorder(y,
type=c("decimal", "roman")). One could even imagine adding support
for "binary", "octal" and "hexadecimal" (not done).
Greg (maintainer of gtools; cc:ed), is this something you would
consider adding to gtools? I've modified the gtools source code
available on CRAN (that's the only source I found), added package
tests, updated the Rd and verified it passes R CMD check. If
interested, please find the updates at:
https://github.com/HenrikBengtsson/gtools/compare/cran:master...master
Thanks
Henrik
On Tue, Aug 26, 2014 at 6:46 PM, David Winsemius <dwinsemius at comcast.net> wrote:
>
> On Aug 26, 2014, at 5:24 PM, Henrik Bengtsson wrote:
>
>> Hi,
>>
>> does anyone know of an implementation/function that sorts strings that
>> *contain* roman numerals (I, II, III, IV, V, ...) which are treated as
>> numbers. In 'gtools' there is mixedsort() which does this for strings
>> that contains (decimal) numbers. I'm looking for a "mixedsortroman()"
>> function that does the same but with roman numbers, e.g.
>
> It's pretty easy to sort something you know to be congruent with the existing roman class:
>
> romanC <- as.character( as.roman(1:3899) )
> match(c("I", "II", "III","X","V"), romanC)
> #[1] 1 2 3 10 5
>
> But I guess you already know that, so you want a regex approach to parsing. Looking at the path taken by Warnes, it would involve doing something like his regex based insertion of a delimiter for "Roman numeral" but simpler because he needed to deal with decimal points and signs and exponent notation, none of which you appear to need. If you only need to consider character and Roman, then this hack of Warnes tools succeeds:
>
> mixedorderRoman <- function (x)
> {
> if (length(x) < 1)
> return(NULL)
> else if (length(x) == 1)
> return(1)
> if (is.numeric(x))
> return(order(x))
> delim = "\\$\\@\\$"
> roman <- function(x) {
> suppressWarnings(match(x, romanC))
> }
> nonnumeric <- function(x) {
> suppressWarnings(ifelse(is.na(as.numeric(x)), toupper(x),
> NA))
> }
> x <- as.character(x)
> which.nas <- which(is.na(x))
> which.blanks <- which(x == "")
> if (length(which.blanks) > 0)
> x[which.blanks] <- -Inf
> if (length(which.nas) > 0)
> x[which.nas] <- Inf
> delimited <- gsub("([IVXCL]+)",
> paste(delim, "\\1", delim, sep = ""), x)
> step1 <- strsplit(delimited, delim)
> step1 <- lapply(step1, function(x) x[x > ""])
> step1.roman <- lapply(step1, roman)
> step1.character <- lapply(step1, nonnumeric)
> maxelem <- max(sapply(step1, length))
> step1.roman.t <- lapply(1:maxelem, function(i) sapply(step1.roman,
> function(x) x[i]))
> step1.character.t <- lapply(1:maxelem, function(i) sapply(step1.character,
> function(x) x[i]))
> rank.roman <- sapply(step1.roman.t, rank)
> rank.character <- sapply(step1.character.t, function(x) as.numeric(factor(x)))
> rank.roman[!is.na(rank.character)] <- 0
> rank.character <- t(t(rank.character) + apply(matrix(rank.roman),
> 2, max, na.rm = TRUE))
> rank.overall <- ifelse(is.na(rank.character), rank.numeric,
> rank.character)
> order.frame <- as.data.frame(rank.overall)
> if (length(which.nas) > 0)
> order.frame[which.nas, ] <- Inf
> retval <- do.call("order", order.frame)
> return(retval)
> }
>
> y[mixedorderRoman(y)]
> [1] "chr I" "chr II" "chr III" "chr IV" "chr IX"
> [6] "chr V" "chr VI" "chr VII" "chr VIII" "chr X"
> [11] "chr XI" "chr XII"
>
>
> --
> David.
>>
>> ## DECIMAL NUMBERS
>>> x <- sprintf("chr %d", 12:1)
>>> x
>> [1] "chr 12" "chr 11" "chr 10" "chr 9" "chr 8"
>> [6] "chr 7" "chr 6" "chr 5" "chr 4" "chr 3"
>> [11] "chr 2" "chr 1"
>>
>>> sort(x)
>> [1] "chr 1" "chr 10" "chr 11" "chr 12" "chr 2"
>> [6] "chr 3" "chr 4" "chr 5" "chr 6" "chr 7"
>> [11] "chr 8" "chr 9"
>>
>>> gtools::mixedsort(x)
>> [1] "chr 1" "chr 2" "chr 3" "chr 4" "chr 5"
>> [6] "chr 6" "chr 7" "chr 8" "chr 9" "chr 10"
>> [11] "chr 11" "chr 12"
>>
>>
>> ## ROMAN NUMBERS
>>> y <- sprintf("chr %s", as.roman(12:1))
>>> y
>> [1] "chr XII" "chr XI" "chr X" "chr IX"
>> [5] "chr VIII" "chr VII" "chr VI" "chr V"
>> [9] "chr IV" "chr III" "chr II" "chr I"
>>
>>> sort(y)
>> [1] "chr I" "chr II" "chr III" "chr IV"
>> [5] "chr IX" "chr V" "chr VI" "chr VII"
>> [9] "chr VIII" "chr X" "chr XI" "chr XII"
>>
>>> mixedsortroman(y)
>> [1] "chr I" "chr II" "chr III" "chr IV"
>> [5] "chr V" "chr VI" "chr VII" "chr VIII"
>> [9] "chr IX" "chr X" "chr XI" "chr XII"
>>
>> The latter is what I'm looking for.
>>
>> Before hacking together something myself (e.g. identify roman numerals
>> substrings, translate them to decimal numbers, use gtools::mixedsort()
>> to sort them and then translate them back to roman numbers), I'd like
>> to hear if someone already has this implemented/know of a package that
>> does this.
>>
>> Thanks,
>>
>> Henrik
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> David Winsemius
> Alameda, CA, USA
>
More information about the R-help
mailing list