[R] Mixed sorting/ordering of strings acknowledging roman numerals?

Mon Sep 8 05:46:55 CEST 2014

On Sep 7, 2014, at 7:40 PM, Henrik Bengtsson wrote:

> Thank you David - it took me awhile to get back to this and dig into
> it.  It's clever to imitate gtools::mixedorder() as far as possible.
> A few comments:
> 
> 1. It took me a while to understand why you picked 3899 in your
> Roman-to-integer table; it's because roman(x) is NA for x > 3899.
> (BTW, in 'utils', there's utils:::.roman2numeric() which could be
> utilized, but it's currently internal.)

Yes, that was the reason. I didn't think I needed a Roman-to-numeric function because I discovered the roman numbers were actually simple numeric vectors to which a class had been assigned and that it was the class-facilities that did all the work. The standard Ops functions were just acting on numeric vectors.

If one doesn't take care, their "romanity" can be lost:

> R <- as.roman(10^(0:4))
> R
[1] I    X    C    M    <NA>

> unclass(R)
[1]    1   10  100 1000   NA

> sum(R, na.rm=TRUE)
[1] 1111
> as.roman(sum(R, na.rm=TRUE))
[1] MCXI

> 
> 2. I think you forgot D=500 and M=1000.

Quite possible. I suspect Greg will have corrected the omission, but if not, this will be helpful to him.

> 
> 3. There was a typo in your code; I think you meant rank.roman instead
> of rank.numeric in one place.
> 

I understood Greg's intention to wrap this into the mixedorder and mixed sort duo.

Best;
David.

> 4. The idea behind nonnumeric() is to identify non-numeric substrings
> by is.na(as.numeric()).  Unfortunately, for romans that does not work.
> Instead, we need to use is.na(numeric(x)) here, i.e.
> 
>  nonnumeric <- function(x) {
>      suppressWarnings(ifelse(is.na(numeric(x)), toupper(x), NA))
>  }
> 
> Actually, gtools::mixedorder() could use the same.
> 
> 5. I undid your ".numeric" to ".roman" to minimize any differences to
> gtools::mixedorder().
> 
> 
> With the above fixes, we now have:
> 
> mixedorderRoman <- function (x)
> {
>    if (length(x) < 1)
>        return(NULL)
>    else if (length(x) == 1)
>        return(1)
>    if (is.numeric(x))
>        return(order(x))
>    delim = "\\$\\@\\$"
>    # NOTE: Note that as.roman(x) is NA for x > 3899
>    romanC <- as.character( as.roman(1:3899) )
>    numeric <- function(x) {
>        suppressWarnings(match(x, romanC))
>    }
>    nonnumeric <- function(x) {
>        suppressWarnings(ifelse(is.na(numeric(x)), toupper(x),
>            NA))
>    }
>    x <- as.character(x)
>    which.nas <- which(is.na(x))
>    which.blanks <- which(x == "")
>    if (length(which.blanks) > 0)
>        x[which.blanks] <- -Inf
>    if (length(which.nas) > 0)
>        x[which.nas] <- Inf
>    delimited <- gsub("([IVXCLM]+)",
>        paste(delim, "\\1", delim, sep = ""), x)
>    step1 <- strsplit(delimited, delim)
>    step1 <- lapply(step1, function(x) x[x > ""])
>    step1.numeric <- lapply(step1, numeric)
>    step1.character <- lapply(step1, nonnumeric)
>    maxelem <- max(sapply(step1, length))
>    step1.numeric.t <- lapply(1:maxelem, function(i) sapply(step1.numeric,
>        function(x) x[i]))
>    step1.character.t <- lapply(1:maxelem, function(i) sapply(step1.character,
>        function(x) x[i]))
>    rank.numeric <- sapply(step1.numeric.t, rank)
>    rank.character <- sapply(step1.character.t, function(x)
> as.numeric(factor(x)))
>    rank.numeric[!is.na(rank.character)] <- 0
>    rank.character <- t(t(rank.character) + apply(matrix(rank.numeric),
>        2, max, na.rm = TRUE))
>    rank.overall <- ifelse(is.na(rank.character), rank.numeric,
>        rank.character)
>    order.frame <- as.data.frame(rank.overall)
>    if (length(which.nas) > 0)
>        order.frame[which.nas, ] <- Inf
>    retval <- do.call("order", order.frame)
>    return(retval)
> }
> 
> 
> The difference to gtools::mixedorder() is minimal:
> 
> <     romanC <- as.character( as.roman(1:3899) )
> 21c11
> <         suppressWarnings(match(x, romanC))
> ---
>>        suppressWarnings(as.numeric(x))
> 24c14
> <         suppressWarnings(ifelse(is.na(numeric(x)), toupper(x),
> ---
>>        suppressWarnings(ifelse(is.na(as.numeric(x)), toupper(x),
> 34c24
> <     delimited <- gsub("([IVXCLDM]+)",
> ---
>>    delimited <- gsub("([+-]{0,1}[0-9]+\\.{0,1}[0-9]*([eE][\\+\\-]{0,1}[0-9]+\\.{0,1}[0-9]*){0,1})",
> 59,62d48
> 
> This difference is so small that the above could now be an option to
> mixedorder() with minimal overhead added, e.g. mixedorder(y,
> type=c("decimal", "roman")).  One could even imagine adding support
> for "binary", "octal" and "hexadecimal" (not done).
> 
> Greg (maintainer of gtools; cc:ed), is this something you would
> consider adding to gtools?  I've modified the gtools source code
> available on CRAN (that's the only source I found), added package
> tests, updated the Rd and verified it passes R CMD check.  If
> interested, please find the updates at:
> 
>  https://github.com/HenrikBengtsson/gtools/compare/cran:master...master
> 
> Thanks
> 
> Henrik
> 
> On Tue, Aug 26, 2014 at 6:46 PM, David Winsemius <dwinsemius at comcast.net> wrote:
>> 
>> On Aug 26, 2014, at 5:24 PM, Henrik Bengtsson wrote:
>> 
>>> Hi,
>>> 
>>> does anyone know of an implementation/function that sorts strings that
>>> *contain* roman numerals (I, II, III, IV, V, ...) which are treated as
>>> numbers.  In 'gtools' there is mixedsort() which does this for strings
>>> that contains (decimal) numbers.  I'm looking for a "mixedsortroman()"
>>> function that does the same but with roman numbers, e.g.
>> 
>> It's pretty easy to sort something you know to be congruent with the existing roman class:
>> 
>> romanC <- as.character( as.roman(1:3899) )
>> match(c("I", "II", "III","X","V"), romanC)
>> #[1]  1  2  3 10  5
>> 
>> But I guess you already know that, so you want a regex approach to parsing. Looking at the path taken by Warnes, it would involve doing something like his regex based insertion of a delimiter for "Roman numeral" but simpler because he needed to deal with decimal points and signs and exponent notation, none of which you appear to need. If you only need to consider character and Roman, then this hack of Warnes tools succeeds:
>> 
>> mixedorderRoman <- function (x)
>> {
>>    if (length(x) < 1)
>>        return(NULL)
>>    else if (length(x) == 1)
>>        return(1)
>>    if (is.numeric(x))
>>        return(order(x))
>>    delim = "\\$\\@\\$"
>>    roman <- function(x) {
>>        suppressWarnings(match(x, romanC))
>>    }
>>    nonnumeric <- function(x) {
>>        suppressWarnings(ifelse(is.na(as.numeric(x)), toupper(x),
>>            NA))
>>    }
>>    x <- as.character(x)
>>    which.nas <- which(is.na(x))
>>    which.blanks <- which(x == "")
>>    if (length(which.blanks) > 0)
>>        x[which.blanks] <- -Inf
>>    if (length(which.nas) > 0)
>>        x[which.nas] <- Inf
>>    delimited <- gsub("([IVXCL]+)",
>>        paste(delim, "\\1", delim, sep = ""), x)
>>    step1 <- strsplit(delimited, delim)
>>    step1 <- lapply(step1, function(x) x[x > ""])
>>    step1.roman <- lapply(step1, roman)
>>    step1.character <- lapply(step1, nonnumeric)
>>    maxelem <- max(sapply(step1, length))
>>    step1.roman.t <- lapply(1:maxelem, function(i) sapply(step1.roman,
>>        function(x) x[i]))
>>    step1.character.t <- lapply(1:maxelem, function(i) sapply(step1.character,
>>        function(x) x[i]))
>>    rank.roman <- sapply(step1.roman.t, rank)
>>    rank.character <- sapply(step1.character.t, function(x) as.numeric(factor(x)))
>>    rank.roman[!is.na(rank.character)] <- 0
>>    rank.character <- t(t(rank.character) + apply(matrix(rank.roman),
>>        2, max, na.rm = TRUE))
>>    rank.overall <- ifelse(is.na(rank.character), rank.numeric,
>>        rank.character)
>>    order.frame <- as.data.frame(rank.overall)
>>    if (length(which.nas) > 0)
>>        order.frame[which.nas, ] <- Inf
>>    retval <- do.call("order", order.frame)
>>    return(retval)
>> }
>> 
>> y[mixedorderRoman(y)]
>> [1] "chr I"    "chr II"   "chr III"  "chr IV"   "chr IX"
>> [6] "chr V"    "chr VI"   "chr VII"  "chr VIII" "chr X"
>> [11] "chr XI"   "chr XII"
>> 
>> 
>> --
>> David.
>>> 
>>> ## DECIMAL NUMBERS
>>>> x <- sprintf("chr %d", 12:1)
>>>> x
>>> [1] "chr 12" "chr 11" "chr 10" "chr 9"  "chr 8"
>>> [6] "chr 7"  "chr 6"  "chr 5"  "chr 4"  "chr 3"
>>> [11] "chr 2"  "chr 1"
>>> 
>>>> sort(x)
>>> [1] "chr 1"  "chr 10" "chr 11" "chr 12" "chr 2"
>>> [6] "chr 3"  "chr 4"  "chr 5"  "chr 6"  "chr 7"
>>> [11] "chr 8"  "chr 9"
>>> 
>>>> gtools::mixedsort(x)
>>> [1] "chr 1"  "chr 2"  "chr 3"  "chr 4"  "chr 5"
>>> [6] "chr 6"  "chr 7"  "chr 8"  "chr 9"  "chr 10"
>>> [11] "chr 11" "chr 12"
>>> 
>>> 
>>> ## ROMAN NUMBERS
>>>> y <- sprintf("chr %s", as.roman(12:1))
>>>> y
>>> [1] "chr XII"  "chr XI"   "chr X"    "chr IX"
>>> [5] "chr VIII" "chr VII"  "chr VI"   "chr V"
>>> [9] "chr IV"   "chr III"  "chr II"   "chr I"
>>> 
>>>> sort(y)
>>> [1] "chr I"    "chr II"   "chr III"  "chr IV"
>>> [5] "chr IX"   "chr V"    "chr VI"   "chr VII"
>>> [9] "chr VIII" "chr X"    "chr XI"   "chr XII"
>>> 
>>>> mixedsortroman(y)
>>> [1] "chr I"    "chr II"   "chr III"  "chr IV"
>>> [5] "chr V"    "chr VI"   "chr VII"  "chr VIII"
>>> [9] "chr IX"   "chr X"    "chr XI"   "chr XII"
>>> 
>>> The latter is what I'm looking for.
>>> 
>>> Before hacking together something myself (e.g. identify roman numerals
>>> substrings, translate them to decimal numbers, use gtools::mixedsort()
>>> to sort them and then translate them back to roman numbers), I'd like
>>> to hear if someone already has this implemented/know of a package that
>>> does this.
>>> 
>>> Thanks,
>>> 
>>> Henrik
>>> 
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>> 
>> David Winsemius
>> Alameda, CA, USA
>> 

David Winsemius
Alameda, CA, USA