[R] Merging by factor variables

Erik Iverson eriki at ccbr.umn.edu
Wed Feb 2 19:01:56 CET 2011



H Roark wrote:
> I'm wondering about the behavior of the merge function when using factors as by variables. I know that when you combine two factors using c() the results can be odd, as in:
> 
> c(factor(1:5),factor(6:10))
> 
> which prints: [1] 1 2 3 4 5 1 2 3 4 5
> 
> I presume this is because factors are actually stored as integers, with 6,7,8,9,10 stored internally as 1,2,3,4,5.
> 
> This concerns me somewhat, as I often merge data frames using factors as the by variables. From what I can tell, the merge function creates matches based on factor labels (i.e. the result of as.character(factor_var)) and not the internally stored integers, but I'm wondering if there are particular lurking problems that I should be aware of?  I'm especially curious as to how R recalculates the levels of the by variables in outer joins where not every observation is matched, as in:
> 
> df1<-data.frame(a=factor(c("a","b")),b=1:2)
> df2<-data.frame(a=factor(c("b","c")),c=2:3)
> df3<-merge(df1,df2,by="a",all=T)

As far as I know, there is no reason to be concerned when using merge
as you do.

The magic that ?merge is performing is actually being done in ?rbind,
and you should read the help for that, particularly under "Data frame
methods". You can also study the code of base.rbind.data.frame to see
what it's actually doing.

--Erik



More information about the R-help mailing list