[R] unexpected (?) behavior of sort=TRUE in merge function

Tue Sep 4 14:15:56 CEST 2012

Hi,
Try this:
convert.type1 <- function(obj,types){
    for (i in 1:length(obj)){
        FUN <- switch(types[i],character = as.character, 
                                   numeric = as.numeric, 
                                   factor = as.factor)
        obj[,i] <- FUN(obj[,i])
    }
    obj
}

test1<-test
 test1[[1]]<-convert.type1(test1[[1]],c("character","numeric","numeric"))
 test1[[2]]<-convert.type1(test1[[2]],c("character","numeric","numeric"))
lapply(test1, function(x) merge(x, expand.grid(product=c("Y1", "Y2", "G", "F", "L", "K"), cong=c(-1,0,1,11)), all=T, sort=TRUE))   
------
------
[[2]]
   product cong        x
1        F   -1 4.315789
2        F    0 5.705263
3        F    1       NA
4        F   11       NA
5        G   -1 3.750000
6        G    0 5.680000
7        G    1       NA
8        G   11       NA
9        K   -1 3.739130
10       K    0 4.967033
11       K    1       NA
12       K   11       NA
13       L   -1 4.500000
14       L    0 6.386364
15       L    1       NA
16       L   11       NA
17      Y1   -1 3.043478
18      Y1    0 4.887640
19      Y1    1       NA
20      Y1   11       NA
21      Y2   -1 4.181818
22      Y2    0 5.207921
23      Y2    1       NA
24      Y2   11       NA

A.K.

----- Original Message -----
From: "Meyners, Michael" <meyners.m at pg.com>
To: "r-help at r-project.org" <r-help at r-project.org>
Cc: 
Sent: Tuesday, September 4, 2012 7:24 AM
Subject: [R] unexpected (?) behavior of sort=TRUE in merge function

All,

I realize from the archive that the sort argument in merge has been subject to discussion before, though I couldn't find an explanation for this behavior. I tried to simplify this to (kind of) minimal code from a real example to the following (and I have no doubts that there are smart people around achieving the same with smarter code :-)). I'm running R 2.15.1 64bit under MS Windows 7, full session info below.

I do have a list with two dataframes:

test <- list(structure(list(product = structure(c(1L, 2L, 3L, 4L, 5L, 
6L, 1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 
4L, 5L, 6L), .Label = c("Y1", "Y2", "G", "F", "L", "K"), class = "factor"), 
    cong = c(-1, -1, -1, -1, -1, -1, 0, 0, 0, 0, 0, 0, 1, 1, 
    1, 1, 1, 1, 11, 11, 11, 11, 11, 11), x = c(5.85714285714286, 
    5.9, 7.3, 5.85714285714286, 7.27272727272727, 4.375, 3.875, 
    2.5, 4.8, 3.625, 6.25, 4.71428571428571, 3.53571428571429, 
    4.63888888888889, 4.42424242424242, 4.78260869565217, 4.875, 
    3.80434782608696, 5.73170731707317, 5.41935483870968, 5.78125, 
    6.30188679245283, 6.87755102040816, 5.56603773584906)), .Names = c("product", 
"cong", "x"), row.names = c(NA, -24L), class = "data.frame"), 
    structure(list(product = structure(c(1L, 2L, 3L, 4L, 5L, 
    6L, 1L, 2L, 3L, 4L, 5L, 6L), .Label = c("Y1", "Y2", "G", 
    "F", "L", "K"), class = "factor"), cong = c(-1, -1, -1, -1, 
    -1, -1, 0, 0, 0, 0, 0, 0), x = c(3.04347826086957, 4.18181818181818, 
    3.75, 4.31578947368421, 4.5, 3.73913043478261, 4.8876404494382, 
    5.20792079207921, 5.68, 5.70526315789474, 6.38636363636364, 
    4.96703296703297)), .Names = c("product", "cong", "x"), row.names = c(NA, 
    -12L), class = "data.frame"))

The dataframes are pretty much the same but for the values in the x-column and the fact that the second one has only half as many observations, missing the second half of the expand.grid if you like. Now if I run

lapply(test, function(x) merge(x, expand.grid(product=c("Y1", "Y2", "G", "F", "L", "K"), cong=c(-1,0,1,11)), all=T, sort=TRUE))      # sort=TRUE is the default, so could be omitted

sorts the first dataframe according to the labels of factor "product", while for the second one the order is maintained from the first dataframes (x) to merge (which is the difference that I could not find being documented). Now I run the same code with sort=FALSE instead:

lapply(test, function(x) merge(x, expand.grid(product=c("Y1", "Y2", "G", "F", "L", "K"), cong=c(-1,0,1,11)), all=T, sort=FALSE))

The results are at least consistent and fulfill my needs (this is, btw, not unexpected from the documentation). Note that I get exactly the same behavior if I apply merge subsequently to test[[1]] and test[[2]], so it is not an issue from lapply. (I realize that my dataframes are ordered by levels of product, but using test[[2]] <- test[[2]][sample(12),] and applying the same code as above reveals that indeed no sorting is done but the order is maintained from the first dataframe.)

I have a working solution for myself, so I'm not after any advice on how to achieve the sorting -- I'd just like to better understand what's going on here and/or what I might have missed in the documentation or in the list archives. 

Thanks in advance, 
Michael

Session info:
R version 2.15.1 (2012-06-22)
Platform: x86_64-pc-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252    LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                    LC_TIME=German_Germany.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base    

loaded via a namespace (and not attached):
[1] tools_2.15.1

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.