[R] Sort problem in merge()

Gregor Gorjanc gregor.gorjanc at bfro.uni-lj.si
Mon Mar 6 10:45:31 CET 2006


Hello!

I am merging two datasets and I have encountered a problem with sort.
Can someone please point me to my error. Here is the example.

## I have dataframes, first one with factor and second one with factor
## and integer
> tmp1 <- data.frame(col1 = factor(c("A", "A", "C", "C", "0", "0")))
> tmp2 <- data.frame(col1 = factor(c("C", "D", "E", "F")), col2 = 1:4)
> tmp1
  col1
1    A
2    A
3    C
4    C
5    0
6    0
> tmp2
  col1 col2
1    C    1
2    D    2
3    E    3
4    F    4

## Now merge them
> (tmp12 <- merge(tmp1, tmp2, by.x = "col1", by.y = "col1",
                  all.x = TRUE, sort = FALSE))
  col1 col2
1    C    1
2    C    1
3    A   NA
4    A   NA
5    0   NA
6    0   NA

## As you can see, sort was applied, since row order is not the same as
## in tmp1. Reading help page for ?merge did not reveal much about
## sorting. However I did try to see the result of "non-default" -
## help page says that order should be the same as in 'y'. So above
## makes sense

## Now merge - but change x an y
> (tmp21 <- merge(tmp2, tmp1, by.x = "col1", by.y = "col1",
                  all.y = TRUE, sort = FALSE))
  col1 col2
1    C    1
2    C    1
3    A   NA
4    A   NA
5    0   NA
6    0   NA

## The result is the same. I am stumped here. But looking a bit at these
## object I found something peculiar

> str(tmp1)
`data.frame':   6 obs. of  1 variable:
 $ col1: Factor w/ 3 levels "0","A","C": 2 2 3 3 1 1
> str(tmp2)
`data.frame':   4 obs. of  2 variables:
 $ col1: Factor w/ 4 levels "C","D","E","F": 1 2 3 4
 $ col2: int  1 2 3 4
> str(tmp12)
`data.frame':   6 obs. of  2 variables:
 $ col1: Factor w/ 3 levels "0","A","C": 3 3 2 2 1 1
 $ col2: int  1 1 NA NA NA NA
> str(tmp21)
`data.frame':   6 obs. of  2 variables:
 $ col1: Factor w/ 6 levels "C","D","E","F",..: 1 1 6 6 5 5
 $ col2: int  1 1 NA NA NA NA

## Is it OK, that internal presentation of factors vary between
## different merges. Levels are also different, once only levels
## from original data.frame are used, while in second example all
## levels are propagated.

## I have tried the same with characters
> tmp1$col1 <- as.character(tmp1$col1)
> tmp2$col1 <- as.character(tmp2$col1)
> (tmp12c <- merge(tmp1, tmp2, by.x = "col1", by.y = "col1",
                  all.x = TRUE, sort = FALSE))
  col1 col2
1    C    1
2    C    1
3    A   NA
4    A   NA
5    0   NA
6    0   NA

> (tmp21c <- merge(tmp2, tmp1, by.x = "col1", by.y = "col1",
                  all.y = TRUE, sort = FALSE))
  col1 col2
1    C    1
2    C    1
3    A   NA
4    A   NA
5    0   NA
6    0   NA

## The same with characters. Is this a bug. It definitely does not agree
## with help page, since order is not the same as in 'y'. Can someone
## please check on newer versions?

## Is there any other way to get the same order as in 'y' i.e. tmp1?

> R.version
         _
platform i486-pc-linux-gnu
arch     i486
os       linux-gnu
system   i486, linux-gnu
status
major    2
minor    2.0
year     2005
month    10
day      06
svn rev  35749
language R

Thank you very much!

-- 
Lep pozdrav / With regards,
    Gregor Gorjanc

----------------------------------------------------------------------
University of Ljubljana     PhD student
Biotechnical Faculty
Zootechnical Department     URI: http://www.bfro.uni-lj.si/MR/ggorjan
Groblje 3                   mail: gregor.gorjanc <at> bfro.uni-lj.si

SI-1230 Domzale             tel: +386 (0)1 72 17 861
Slovenia, Europe            fax: +386 (0)1 72 17 888

----------------------------------------------------------------------
"One must learn by doing the thing; for though you think you know it,
 you have no certainty until you try." Sophocles ~ 450 B.C.




More information about the R-help mailing list