[R] sorting variable names containing digits
John Fox
jfox at mcmaster.ca
Mon Dec 22 06:32:10 CET 2008
Dear Gabor,
Thank you (again) for this second suggestion, which does exactly what I
want. At the risk of appearing ungrateful, and although the judgment is
admittedly subjective, I don't find it simpler than mysort().
For curiosity, I tried some timings of the two functions for the sample
problems that I supplied:
> system.time(for (i in 1:100) mysort(s))
user system elapsed
1.498 0.006 1.503
> system.time(for (i in 1:100) mysort2(s))
user system elapsed
6.026 0.028 6.059
> system.time(for (i in 1:100) mysort(t))
user system elapsed
0.858 0.003 0.874
> system.time(for (i in 1:100) mysort2(t))
user system elapsed
2.736 0.014 2.757
This is on a 2.4 GHz Core 2 Duo MacBook. I don't know of course
whether this generalizes to other problems. I suspect that the
recursive solution will look worse as the number of "components" of the
names increases, but of course names are unlikely to have a large
number of components.
Best,
John
On Sun, 21 Dec 2008 23:28:51 -0500
"Gabor Grothendieck" <ggrothendieck at gmail.com> wrote:
> Another possibility is to use strapply in gsubfn giving a solution
> that is non-recursive and shorter:
>
> library(gsubfn)
>
> mysort2 <- function(s) {
> L <- strapply(s, "([0-9]+)|([^0-9]+)",
> ~ if (nchar(x)) sprintf("%9d", as.numeric(x)) else y)
> L2 <- t(do.call(cbind, lapply(L, ts)))
> L3 <- replace(L2, is.na(L2), "")
> ord <- do.call(order, as.data.frame(L3, stringsAsFactors = FALSE))
> s[ord]
> }
>
>
> First strapply breaks up each string into a character vector of the
> numeric
> and non-numeric components. We pad each numeric component on the
> left with spaces using sprintf so they are all 9 wide. The next line
> turns that
> into a matrix L2 and then we replace the NAs giving L3. Finally we
> order it
> and apply the ordering, ord, to get the sorted version.
>
> The gsubfn home page is at:
> http://gsubfn.googlecode.com
>
> Here is some sample output:
>
> > mysort2(s)
> [1] "var2" "var10a2" "x1a" "x1b" "x02" "x02a"
> "x02b" "y1a1" "y2" "y10" "y10a1" "y10a2" "y10a10"
> > mysort(s)
> [1] "var2" "var10a2" "x1a" "x1b" "x02" "x02a"
> "x02b" "y1a1" "y2" "y10" "y10a1" "y10a2" "y10a10"
>
> > mysort2(t)
> [1] "q2.1.1" "q10.1.1" "q10.2.1" "q10.10.2"
> > mysort(t)
> [1] "q2.1.1" "q10.1.1" "q10.2.1" "q10.10.2"
>
>
> On Sun, Dec 21, 2008 at 9:57 PM, John Fox <jfox at mcmaster.ca> wrote:
> > Dear Gabor,
> >
> > Thanks for this -- I was unaware of mixedsort(). As you point out,
> > however, mixedsort() doesn't cover all of the cases in which I'm
> > interested and which are handled by mysort().
> >
> > Regards,
> > John
> >
> > On Sun, 21 Dec 2008 20:51:17 -0500
> > "Gabor Grothendieck" <ggrothendieck at gmail.com> wrote:
> >> mixedsort in gtools will give the same result as mysort(s) but
> >> differs in the case of t.
> >>
> >> On Sun, Dec 21, 2008 at 8:33 PM, John Fox <jfox at mcmaster.ca>
> wrote:
> >> > Dear r-helpers,
> >> >
> >> > I'm looking for a way of sorting variable names in a "natural"
> >> order, when
> >> > the names are composed of digits and other characters. I know
> that
> >> this is a
> >> > vague idea, and that sorting character strings is a complex
> topic,
> >> but
> >> > perhaps a couple of examples will clarify what I mean:
> >> >
> >> >> s <- c("x1b", "x1a", "x02b", "x02a", "x02", "y1a1", "y10a2",
> >> > + "y10a10", "y10a1", "y2", "var10a2", "var2", "y10")
> >> >
> >> >> sort(s)
> >> > [1] "var10a2" "var2" "x02" "x02a" "x02b" "x1a"
> >> > [7] "x1b" "y10" "y10a1" "y10a10" "y10a2" "y1a1"
> >> > [13] "y2"
> >> >
> >> >> mysort(s)
> >> > [1] "var2" "var10a2" "x1a" "x1b" "x02" "x02a"
> >> > [7] "x02b" "y1a1" "y2" "y10" "y10a1" "y10a2"
> >> > [13] "y10a10"
> >> >
> >> >> t <- c("q10.1.1", "q10.2.1", "q2.1.1", "q10.10.2")
> >> >
> >> >> sort(t)
> >> > [1] "q10.1.1" "q10.10.2" "q10.2.1" "q2.1.1"
> >> >
> >> >> mysort(t)
> >> > [1] "q2.1.1" "q10.1.1" "q10.2.1" "q10.10.2"
> >> >
> >> > Here, sort() is the standard R function and mysort() is a
> >> replacement, which
> >> > sorts the names into the order that seems natural to me, at
> least
> >> in the
> >> > cases that I've tried:
> >> >
> >> > mysort <- function(x){
> >> > sort.helper <- function(x){
> >> > prefix <- strsplit(x, "[0-9]")
> >> > prefix <- sapply(prefix, "[", 1)
> >> > prefix[is.na(prefix)] <- ""
> >> > suffix <- strsplit(x, "[^0-9]")
> >> > suffix <- as.numeric(sapply(suffix, "[", 2))
> >> > suffix[is.na(suffix)] <- -Inf
> >> > remainder <- sub("[^0-9]+", "", x)
> >> > remainder <- sub("[0-9]+", "", remainder)
> >> > if (all (remainder == "")) list(prefix, suffix)
> >> > else c(list(prefix, suffix), Recall(remainder))
> >> > }
> >> > ord <- do.call("order", sort.helper(x))
> >> > x[ord]
> >> > }
> >> >
> >> > I have a couple of applications in mind, one of which is
> >> recognizing
> >> > repeated-measures variables in "wide" longitudinal datasets,
> which
> >> often are
> >> > named in the form x1, x2, ... , xn.
> >> >
> >> > mysort(), which works by recursively slicing off pairs of
> non-digit
> >> and
> >> > digit strings, seems more complicated than it should have to be,
> >> and I
> >> > wonder whether anyone has a more elegant solution. I don't think
> >> that
> >> > efficiency is a serious issue for the applications I'm
> considering,
> >> but of
> >> > course a more efficient solution would be of interest.
> >> >
> >> > Thanks,
> >> > John
> >> >
> >> > ------------------------------
> >> > John Fox, Professor
> >> > Department of Sociology
> >> > McMaster University
> >> > Hamilton, Ontario, Canada
> >> > web: socserv.mcmaster.ca/jfox
> >> >
> >> > ______________________________________________
> >> > R-help at r-project.org mailing list
> >> > https://stat.ethz.ch/mailman/listinfo/r-help
> >> > PLEASE do read the posting guide
> >> http://www.R-project.org/posting-guide.html
> >> > and provide commented, minimal, self-contained, reproducible
> code.
> >> >
> >
> > --------------------------------
> > John Fox, Professor
> > Department of Sociology
> > McMaster University
> > Hamilton, Ontario, Canada
> > http://socserv.mcmaster.ca/jfox/
> >
--------------------------------
John Fox, Professor
Department of Sociology
McMaster University
Hamilton, Ontario, Canada
http://socserv.mcmaster.ca/jfox/
More information about the R-help
mailing list