[R] Update a variable in a dataframe based on variables in another dataframe of a different size

Gabor Grothendieck ggrothendieck at gmail.com
Tue Nov 12 02:20:38 CET 2013


On Mon, Nov 11, 2013 at 8:04 PM, Lopez, Dan <lopez235 at llnl.gov> wrote:
> Below is how I am currently doing this. Is there a more efficient way to do this?
> The scenario is that I have two dataframes of different sizes. I need to update one binary factor variable in one of those dataframes by matching on two variables. If there is no match keep as is otherwise update. Also the variable being update, TT in this case should remain a binary factor variable (levels='HC','TER')
>
> HTDF2<-merge(H_DF,T_DF,by=c("FY","ID"),all.x=T)
> HTDF2$TT<-factor(ifelse(is.na(HTDF2$TT.y),HTDF2$TT.x,HTDF2$TT.y),labels=c("HC","TER"))
> HTDF2<-HTDF2[,-(3:4)]
>
>
> # REPRODUCIBLE EXAMPLE DATA FOR ABOVE..
>> dput(H_DF)
> structure(list(FY = structure(c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L,
> 5L), .Label = c("FY09", "FY10", "FY11", "FY12", "FY13"), class = "factor"),
>     ID = c(1, 1, 1, 1, 2, 2, 2, 2, 2), TT = structure(c(1L, 1L,
>     1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("HC", "TER"), class = "factor")), .Names = c("FY",
> "ID", "TT"), class = "data.frame", row.names = c(1L, 2L, 3L,
> 4L, 6L, 7L, 9L, 10L, 11L))
>> dput(T_DF)
> structure(list(FY = structure(c(4L, 2L, 5L), .Label = c("FY09",
> "FY10", "FY11", "FY12", "FY13"), class = "factor"), ID = c(1,
> 2, 2), TT = structure(c(2L, 2L, 2L), .Label = c("HC", "TER"), class = "factor")), .Names = c("FY",
> "ID", "TT"), row.names = c(5L, 8L, 12L), class = "data.frame")
>

Here is an sqldf solution:

> library(sqldf)
> sqldf("select FY, ID, coalesce(t.TT, h.TT) TT from H_DF h left join T_DF t using(FY, ID)")
    FY ID  TT
1 FY09  1  HC
2 FY10  1  HC
3 FY11  1  HC
4 FY12  1 TER
5 FY09  2  HC
6 FY10  2 TER
7 FY11  2  HC
8 FY12  2  HC
9 FY13  2 TER


-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com



More information about the R-help mailing list