[R] sorting variable names containing digits
Gabor Grothendieck
ggrothendieck at gmail.com
Mon Dec 22 13:40:17 CET 2008
Note that mysort2 is slightly more general as it handles the case
that the strings begin with numerics:
> u <- c("51a2", "2a4")
> mysort(u)
[1] "51a2" "2a4"
> mysort2(u)
[1] "2a4" "51a2"
On Mon, Dec 22, 2008 at 12:32 AM, John Fox <jfox at mcmaster.ca> wrote:
> Dear Gabor,
>
> Thank you (again) for this second suggestion, which does exactly what I
> want. At the risk of appearing ungrateful, and although the judgment is
> admittedly subjective, I don't find it simpler than mysort().
>
> For curiosity, I tried some timings of the two functions for the sample
> problems that I supplied:
>
>> system.time(for (i in 1:100) mysort(s))
> user system elapsed
> 1.498 0.006 1.503
>
>> system.time(for (i in 1:100) mysort2(s))
> user system elapsed
> 6.026 0.028 6.059
>
>> system.time(for (i in 1:100) mysort(t))
> user system elapsed
> 0.858 0.003 0.874
>
>> system.time(for (i in 1:100) mysort2(t))
> user system elapsed
> 2.736 0.014 2.757
>
> This is on a 2.4 GHz Core 2 Duo MacBook. I don't know of course
> whether this generalizes to other problems. I suspect that the
> recursive solution will look worse as the number of "components" of the
> names increases, but of course names are unlikely to have a large
> number of components.
>
> Best,
> John
>
> On Sun, 21 Dec 2008 23:28:51 -0500
> "Gabor Grothendieck" <ggrothendieck at gmail.com> wrote:
>> Another possibility is to use strapply in gsubfn giving a solution
>> that is non-recursive and shorter:
>>
>> library(gsubfn)
>>
>> mysort2 <- function(s) {
>> L <- strapply(s, "([0-9]+)|([^0-9]+)",
>> ~ if (nchar(x)) sprintf("%9d", as.numeric(x)) else y)
>> L2 <- t(do.call(cbind, lapply(L, ts)))
>> L3 <- replace(L2, is.na(L2), "")
>> ord <- do.call(order, as.data.frame(L3, stringsAsFactors = FALSE))
>> s[ord]
>> }
>>
>>
>> First strapply breaks up each string into a character vector of the
>> numeric
>> and non-numeric components. We pad each numeric component on the
>> left with spaces using sprintf so they are all 9 wide. The next line
>> turns that
>> into a matrix L2 and then we replace the NAs giving L3. Finally we
>> order it
>> and apply the ordering, ord, to get the sorted version.
>>
>> The gsubfn home page is at:
>> http://gsubfn.googlecode.com
>>
>> Here is some sample output:
>>
>> > mysort2(s)
>> [1] "var2" "var10a2" "x1a" "x1b" "x02" "x02a"
>> "x02b" "y1a1" "y2" "y10" "y10a1" "y10a2" "y10a10"
>> > mysort(s)
>> [1] "var2" "var10a2" "x1a" "x1b" "x02" "x02a"
>> "x02b" "y1a1" "y2" "y10" "y10a1" "y10a2" "y10a10"
>>
>> > mysort2(t)
>> [1] "q2.1.1" "q10.1.1" "q10.2.1" "q10.10.2"
>> > mysort(t)
>> [1] "q2.1.1" "q10.1.1" "q10.2.1" "q10.10.2"
>>
>>
>> On Sun, Dec 21, 2008 at 9:57 PM, John Fox <jfox at mcmaster.ca> wrote:
>> > Dear Gabor,
>> >
>> > Thanks for this -- I was unaware of mixedsort(). As you point out,
>> > however, mixedsort() doesn't cover all of the cases in which I'm
>> > interested and which are handled by mysort().
>> >
>> > Regards,
>> > John
>> >
>> > On Sun, 21 Dec 2008 20:51:17 -0500
>> > "Gabor Grothendieck" <ggrothendieck at gmail.com> wrote:
>> >> mixedsort in gtools will give the same result as mysort(s) but
>> >> differs in the case of t.
>> >>
>> >> On Sun, Dec 21, 2008 at 8:33 PM, John Fox <jfox at mcmaster.ca>
>> wrote:
>> >> > Dear r-helpers,
>> >> >
>> >> > I'm looking for a way of sorting variable names in a "natural"
>> >> order, when
>> >> > the names are composed of digits and other characters. I know
>> that
>> >> this is a
>> >> > vague idea, and that sorting character strings is a complex
>> topic,
>> >> but
>> >> > perhaps a couple of examples will clarify what I mean:
>> >> >
>> >> >> s <- c("x1b", "x1a", "x02b", "x02a", "x02", "y1a1", "y10a2",
>> >> > + "y10a10", "y10a1", "y2", "var10a2", "var2", "y10")
>> >> >
>> >> >> sort(s)
>> >> > [1] "var10a2" "var2" "x02" "x02a" "x02b" "x1a"
>> >> > [7] "x1b" "y10" "y10a1" "y10a10" "y10a2" "y1a1"
>> >> > [13] "y2"
>> >> >
>> >> >> mysort(s)
>> >> > [1] "var2" "var10a2" "x1a" "x1b" "x02" "x02a"
>> >> > [7] "x02b" "y1a1" "y2" "y10" "y10a1" "y10a2"
>> >> > [13] "y10a10"
>> >> >
>> >> >> t <- c("q10.1.1", "q10.2.1", "q2.1.1", "q10.10.2")
>> >> >
>> >> >> sort(t)
>> >> > [1] "q10.1.1" "q10.10.2" "q10.2.1" "q2.1.1"
>> >> >
>> >> >> mysort(t)
>> >> > [1] "q2.1.1" "q10.1.1" "q10.2.1" "q10.10.2"
>> >> >
>> >> > Here, sort() is the standard R function and mysort() is a
>> >> replacement, which
>> >> > sorts the names into the order that seems natural to me, at
>> least
>> >> in the
>> >> > cases that I've tried:
>> >> >
>> >> > mysort <- function(x){
>> >> > sort.helper <- function(x){
>> >> > prefix <- strsplit(x, "[0-9]")
>> >> > prefix <- sapply(prefix, "[", 1)
>> >> > prefix[is.na(prefix)] <- ""
>> >> > suffix <- strsplit(x, "[^0-9]")
>> >> > suffix <- as.numeric(sapply(suffix, "[", 2))
>> >> > suffix[is.na(suffix)] <- -Inf
>> >> > remainder <- sub("[^0-9]+", "", x)
>> >> > remainder <- sub("[0-9]+", "", remainder)
>> >> > if (all (remainder == "")) list(prefix, suffix)
>> >> > else c(list(prefix, suffix), Recall(remainder))
>> >> > }
>> >> > ord <- do.call("order", sort.helper(x))
>> >> > x[ord]
>> >> > }
>> >> >
>> >> > I have a couple of applications in mind, one of which is
>> >> recognizing
>> >> > repeated-measures variables in "wide" longitudinal datasets,
>> which
>> >> often are
>> >> > named in the form x1, x2, ... , xn.
>> >> >
>> >> > mysort(), which works by recursively slicing off pairs of
>> non-digit
>> >> and
>> >> > digit strings, seems more complicated than it should have to be,
>> >> and I
>> >> > wonder whether anyone has a more elegant solution. I don't think
>> >> that
>> >> > efficiency is a serious issue for the applications I'm
>> considering,
>> >> but of
>> >> > course a more efficient solution would be of interest.
>> >> >
>> >> > Thanks,
>> >> > John
>> >> >
>> >> > ------------------------------
>> >> > John Fox, Professor
>> >> > Department of Sociology
>> >> > McMaster University
>> >> > Hamilton, Ontario, Canada
>> >> > web: socserv.mcmaster.ca/jfox
>> >> >
>> >> > ______________________________________________
>> >> > R-help at r-project.org mailing list
>> >> > https://stat.ethz.ch/mailman/listinfo/r-help
>> >> > PLEASE do read the posting guide
>> >> http://www.R-project.org/posting-guide.html
>> >> > and provide commented, minimal, self-contained, reproducible
>> code.
>> >> >
>> >
>> > --------------------------------
>> > John Fox, Professor
>> > Department of Sociology
>> > McMaster University
>> > Hamilton, Ontario, Canada
>> > http://socserv.mcmaster.ca/jfox/
>> >
>
> --------------------------------
> John Fox, Professor
> Department of Sociology
> McMaster University
> Hamilton, Ontario, Canada
> http://socserv.mcmaster.ca/jfox/
>
More information about the R-help
mailing list