[R] Update a variable in a dataframe based on variables in another dataframe of a different size

Thu Nov 14 01:58:04 CET 2013

Dan,

Gabor's solution is of course good, but here's a solution that uses only
base R capabilities, and doesn't sort as merge() does. Essentially the
same as A.K.'s, but slightly more general.

tmp1 <- match( paste(T_DF$FY,T_DF$ID) , paste(H_DF$FY,H_DF$ID) )
H_DF$TT[tmp1] <- T_DF$TT

gg <- sqldf("select FY, ID, coalesce(t.TT, h.TT) TT from H_DF h left join
T_DF t using(FY, ID)")

> for (nm in names(H_DF)) print(all.equal(H_DF[[nm]], gg[[nm]]))
[1] TRUE
[1] TRUE
[1] TRUE

It could be made into a one-liner.It would probably break if TT doesn't
have the same factor levels in both H_DF and T_DF.

As an aside, I suspect that nowadays match() is generally
under-appreciated among R users as a whole.

-Don

-- 
Don MacQueen

Lawrence Livermore National Laboratory
7000 East Ave., L-627
Livermore, CA 94550
925-423-1062

On 11/11/13 5:20 PM, "Gabor Grothendieck" <ggrothendieck at gmail.com> wrote:

>On Mon, Nov 11, 2013 at 8:04 PM, Lopez, Dan <lopez235 at llnl.gov> wrote:
>> Below is how I am currently doing this. Is there a more efficient way
>>to do this?
>> The scenario is that I have two dataframes of different sizes. I need
>>to update one binary factor variable in one of those dataframes by
>>matching on two variables. If there is no match keep as is otherwise
>>update. Also the variable being update, TT in this case should remain a
>>binary factor variable (levels='HC','TER')
>>
>> HTDF2<-merge(H_DF,T_DF,by=c("FY","ID"),all.x=T)
>> 
>>HTDF2$TT<-factor(ifelse(is.na(HTDF2$TT.y),HTDF2$TT.x,HTDF2$TT.y),labels=c
>>("HC","TER"))
>> HTDF2<-HTDF2[,-(3:4)]
>>
>>
>> # REPRODUCIBLE EXAMPLE DATA FOR ABOVE..
>>> dput(H_DF)
>> structure(list(FY = structure(c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L,
>> 5L), .Label = c("FY09", "FY10", "FY11", "FY12", "FY13"), class =
>>"factor"),
>>     ID = c(1, 1, 1, 1, 2, 2, 2, 2, 2), TT = structure(c(1L, 1L,
>>     1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("HC", "TER"), class =
>>"factor")), .Names = c("FY",
>> "ID", "TT"), class = "data.frame", row.names = c(1L, 2L, 3L,
>> 4L, 6L, 7L, 9L, 10L, 11L))
>>> dput(T_DF)
>> structure(list(FY = structure(c(4L, 2L, 5L), .Label = c("FY09",
>> "FY10", "FY11", "FY12", "FY13"), class = "factor"), ID = c(1,
>> 2, 2), TT = structure(c(2L, 2L, 2L), .Label = c("HC", "TER"), class =
>>"factor")), .Names = c("FY",
>> "ID", "TT"), row.names = c(5L, 8L, 12L), class = "data.frame")
>>
>
>Here is an sqldf solution:
>
>> library(sqldf)
>> sqldf("select FY, ID, coalesce(t.TT, h.TT) TT from H_DF h left join
>>T_DF t using(FY, ID)")
>    FY ID  TT
>1 FY09  1  HC
>2 FY10  1  HC
>3 FY11  1  HC
>4 FY12  1 TER
>5 FY09  2  HC
>6 FY10  2 TER
>7 FY11  2  HC
>8 FY12  2  HC
>9 FY13  2 TER
>
>
>-- 
>Statistics & Software Consulting
>GKX Group, GKX Associates Inc.
>tel: 1-877-GKX-GROUP
>email: ggrothendieck at gmail.com
>
>______________________________________________
>R-help at r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.