[R] sorting variable names containing digits

Gabor Grothendieck ggrothendieck at gmail.com
Mon Dec 22 05:28:51 CET 2008


Another possibility is to use strapply in gsubfn giving a solution
that is non-recursive and shorter:

library(gsubfn)

mysort2 <- function(s) {
	L <- strapply(s, "([0-9]+)|([^0-9]+)",
		~ if (nchar(x)) sprintf("%9d", as.numeric(x)) else y)
	L2 <- t(do.call(cbind, lapply(L, ts)))
	L3 <- replace(L2, is.na(L2), "")
	ord <- do.call(order, as.data.frame(L3, stringsAsFactors = FALSE))
	s[ord]
}


First strapply breaks up each string into a character vector of the numeric
and non-numeric components.  We pad each numeric component on the
left with spaces using sprintf so they are all 9 wide.  The next line
turns that
into a matrix L2 and then we replace the NAs giving L3.  Finally we order it
and apply the ordering, ord, to get the sorted version.

The gsubfn home page is at:
http://gsubfn.googlecode.com

Here is some sample output:

> mysort2(s)
 [1] "var2"    "var10a2" "x1a"     "x1b"     "x02"     "x02a"
"x02b"    "y1a1"    "y2"      "y10"     "y10a1"   "y10a2"   "y10a10"
> mysort(s)
 [1] "var2"    "var10a2" "x1a"     "x1b"     "x02"     "x02a"
"x02b"    "y1a1"    "y2"      "y10"     "y10a1"   "y10a2"   "y10a10"

> mysort2(t)
[1] "q2.1.1"   "q10.1.1"  "q10.2.1"  "q10.10.2"
> mysort(t)
[1] "q2.1.1"   "q10.1.1"  "q10.2.1"  "q10.10.2"


On Sun, Dec 21, 2008 at 9:57 PM, John Fox <jfox at mcmaster.ca> wrote:
> Dear Gabor,
>
> Thanks for this -- I was unaware of mixedsort(). As you point out,
> however, mixedsort() doesn't cover all of the cases in which I'm
> interested and which are handled by mysort().
>
> Regards,
>  John
>
> On Sun, 21 Dec 2008 20:51:17 -0500
>  "Gabor Grothendieck" <ggrothendieck at gmail.com> wrote:
>> mixedsort in gtools will give the same result as mysort(s) but
>> differs in the case of t.
>>
>> On Sun, Dec 21, 2008 at 8:33 PM, John Fox <jfox at mcmaster.ca> wrote:
>> > Dear r-helpers,
>> >
>> > I'm looking for a way of sorting variable names in a "natural"
>> order, when
>> > the names are composed of digits and other characters. I know that
>> this is a
>> > vague idea, and that sorting character strings is a complex topic,
>> but
>> > perhaps a couple of examples will clarify what I mean:
>> >
>> >> s <- c("x1b", "x1a", "x02b", "x02a", "x02", "y1a1", "y10a2",
>> > +   "y10a10", "y10a1", "y2", "var10a2", "var2", "y10")
>> >
>> >> sort(s)
>> >  [1] "var10a2" "var2"    "x02"     "x02a"    "x02b"    "x1a"
>> >  [7] "x1b"     "y10"     "y10a1"   "y10a10"  "y10a2"   "y1a1"
>> > [13] "y2"
>> >
>> >> mysort(s)
>> >  [1] "var2"    "var10a2" "x1a"     "x1b"     "x02"     "x02a"
>> >  [7] "x02b"    "y1a1"    "y2"      "y10"     "y10a1"   "y10a2"
>> > [13] "y10a10"
>> >
>> >> t <- c("q10.1.1", "q10.2.1", "q2.1.1", "q10.10.2")
>> >
>> >> sort(t)
>> > [1] "q10.1.1"  "q10.10.2" "q10.2.1"  "q2.1.1"
>> >
>> >> mysort(t)
>> > [1] "q2.1.1"   "q10.1.1"  "q10.2.1"  "q10.10.2"
>> >
>> > Here, sort() is the standard R function and mysort() is a
>> replacement, which
>> > sorts the names into the order that seems natural to me, at least
>> in the
>> > cases that I've tried:
>> >
>> > mysort <- function(x){
>> >  sort.helper <- function(x){
>> >    prefix <- strsplit(x, "[0-9]")
>> >    prefix <- sapply(prefix, "[", 1)
>> >    prefix[is.na(prefix)] <- ""
>> >    suffix <- strsplit(x, "[^0-9]")
>> >    suffix <- as.numeric(sapply(suffix, "[", 2))
>> >    suffix[is.na(suffix)] <- -Inf
>> >    remainder <- sub("[^0-9]+", "", x)
>> >    remainder <- sub("[0-9]+", "", remainder)
>> >    if (all (remainder == "")) list(prefix, suffix)
>> >    else c(list(prefix, suffix), Recall(remainder))
>> >    }
>> >  ord <- do.call("order", sort.helper(x))
>> >  x[ord]
>> >   }
>> >
>> > I have a couple of applications in mind, one of which is
>> recognizing
>> > repeated-measures variables in "wide" longitudinal datasets, which
>> often are
>> > named in the form x1, x2, ... , xn.
>> >
>> > mysort(), which works by recursively slicing off pairs of non-digit
>> and
>> > digit strings, seems more complicated than it should have to be,
>> and I
>> > wonder whether anyone has a more elegant solution. I don't think
>> that
>> > efficiency is a serious issue for the applications I'm considering,
>> but of
>> > course a more efficient solution would be of interest.
>> >
>> > Thanks,
>> >  John
>> >
>> > ------------------------------
>> > John Fox, Professor
>> > Department of Sociology
>> > McMaster University
>> > Hamilton, Ontario, Canada
>> > web: socserv.mcmaster.ca/jfox
>> >
>> > ______________________________________________
>> > R-help at r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>> >
>
> --------------------------------
> John Fox, Professor
> Department of Sociology
> McMaster University
> Hamilton, Ontario, Canada
> http://socserv.mcmaster.ca/jfox/
>



More information about the R-help mailing list