[R] merge multiple data frames

Massimo Bressan mbressan at arpa.veneto.it
Tue Jan 31 08:44:20 CET 2012


thanks don
I have here enough to study for a while....
thank you for your help
max
----- Original Message ----- 
From: "MacQueen, Don" <macqueen1 at llnl.gov>
To: "Massimo Bressan" <mbressan at arpa.veneto.it>; <r-help at r-project.org>
Sent: Monday, January 30, 2012 4:47 PM
Subject: Re: [R] merge multiple data frames


> Does this example help? It doesn't handle the problem of common field
> names, but see below for another example.
>
> df1 <- data.frame(jn=1:4, a1=letters[1:4], a2=LETTERS[1:4])
> df2 <- data.frame(jn=2:6, b1=month.abb[2:6])
> df3 <- data.frame(jn=3:7, x=rnorm(5), y=13:17)
>
> dfn <- sqldf('select * from df1 left join df2 using (jn) left join df3
> using (jn)')
>
> In this example, you automatically get all fields from all three data
> frames, without having to name them in the SQL statement -- but you should
> not have common names.
>
>
> To deal with common names, I myself would probably rename the variables in
> the data frames before trying to merge.
>
> A general method would be something like:
>  nms1 <- names(df1)
>  nms1[nms1 != 'date'] <- paste(nms1[nms1 != 'date'],'.1',sep='')
>  names(df1) <- nms1
> Of course it has to be done for every data frame, but this can be put in a
> loop, if necessary.
>
>
> However, here is an example where I have changed df1 and df2; they both
> have a field named 'aa', in addition to the matching field.
>
> df1 <- data.frame(jn=1:4, aa=letters[1:4], a2=LETTERS[1:4])
> df2 <- data.frame(jn=2:6, aa=month.abb[2:6])
> df3 <- data.frame(jn=3:7, x=rnorm(5), y=13:17)
>
> dfn <- sqldf('select jn, df1.aa aa1, df2.aa aa2,
>              a2, x, y
>   from df1 left join df2 using (jn) left join df3 using (jn)')
>
> By the way, you can still select *, even with common names:
>
>
>  dfx <- sqldf('select *   from df1 left join df2 using (jn) left join df3
> using (jn)')but you might not like the result. Try it and see!
>
>
>
>
> It's my understanding that in the current SQL definition 'as' is no longer
> required when changing field names (though it is also still allowed in the
> databases I work with, Oracle and MySQL). Perhaps sqldf does not allow it.
> I don't know.
>
> Hope this helps.
>
> -Don
>
>
>
> -- 
> Don MacQueen
>
> Lawrence Livermore National Laboratory
> 7000 East Ave., L-627
> Livermore, CA 94550
> 925-423-1062
>
>
>
>
>
> On 1/30/12 4:40 AM, "Massimo Bressan" <mbressan at arpa.veneto.it> wrote:
>
>>hi don
>>
>>I followed your advice about using sqldf package but the problem of
>>labelling the fields persists;
>>for some reasons I can not properly handle the sql 'as' statement....
>>
>>a_b<-sqldf("select a.*, b.* from a left join b on a.date=b.date")
>>a_b_c<-sqldf("select a_b.*, c.* from a_b left join c on a_b.date=c.date")
>>
>>bye
>>
>>max
>>
>>
>>
>>
>>
>>----- Original Message -----
>>From: "MacQueen, Don" <macqueen1 at llnl.gov>
>>To: "maxbre" <mbressan at arpa.veneto.it>; <r-help at r-project.org>
>>Sent: Saturday, January 28, 2012 12:24 AM
>>Subject: Re: [R] merge multiple data frames
>>
>>
>>Not tested, but this might be a case for the sqldf package.
>>
>>-Don
>>
>>-- 
>>Don MacQueen
>>
>>Lawrence Livermore National Laboratory
>>7000 East Ave., L-627
>>Livermore, CA 94550
>>925-423-1062
>>
>>
>>
>>
>>
>>On 1/26/12 9:29 AM, "maxbre" <mbressan at arpa.veneto.it> wrote:
>>
>>>This is my reproducible example (three data frames: a, b, c)
>>>
>>>a<-structure(list(date = structure(1:6, .Label = c("2012-01-03",
>>>"2012-01-04", "2012-01-05", "2012-01-06", "2012-01-07", "2012-01-08",
>>>"2012-01-09", "2012-01-10", "2012-01-11", "2012-01-12", "2012-01-13",
>>>"2012-01-14", "2012-01-15", "2012-01-16", "2012-01-17", "2012-01-18",
>>>"2012-01-19", "2012-01-20", "2012-01-21", "2012-01-22", "2012-01-23"
>>>), class = "factor"), so2 = c(0.799401398190476, 0, 0,
>>>0.0100453950434783,
>>>0.200154920565217, 0.473866969181818), nox = c(111.716109973913,
>>>178.077239330435, 191.257829021739, 50.6799951473913, 115.284643540435,
>>>110.425185027727), no = c(48.8543691516522, 88.7197448817391,
>>>93.9931932472609, 13.9759949817391, 43.1395266865217, 41.7280296016364
>>>), no2 = c(36.8673432865217, 42.37150668, 47.53311701, 29.3026882474783,
>>>49.2986070321739, 46.5978461731818), co = c(0.618856168125,
>>>0.996593475083333,
>>>0.666987416083333, 0.383437311166667, 0.281604928875, 0.155383408913043
>>>), o3 = c(12.1393100029167, 12.3522739816522, 10.9908791203043,
>>>26.9122200013043, 13.8421695947826, 12.3788847045455), ipa =
>>>c(167.541954974667,
>>>252.7196257875, 231.802370709167, 83.4850259595833, 174.394613581667,
>>>173.868599272609), ws = c(1.47191016429167, 0.765781205208333,
>>>0.937053086791667, 1.581022406625, 0.909756802125, 0.959252831695652
>>>), wd = c(45.2650019737732, 28.2493544114369, 171.049080544214,
>>>319.753674830936, 33.8713897347193, 228.368119533759), temp =
>>>c(7.91972825883333,
>>>3.79434291520833, 2.1287644735, 6.733854600625, 3.136579722,
>>>3.09864120704348), umr = c(86.11566638875, 94.5034087491667,
>>>94.14451249375, 53.1016709004167, 65.63420423, 74.955669236087
>>>)), .Names = c("date", "so2", "nox", "no", "no2", "co", "o3",
>>>"ipa", "ws", "wd", "temp", "umr"), row.names = c(NA, 6L), class =
>>>"data.frame")
>>>
>>>
>>>b<-structure(list(date = structure(1:6, .Label = c("2012-01-03",
>>>"2012-01-04", "2012-01-05", "2012-01-06", "2012-01-07", "2012-01-08",
>>>"2012-01-09", "2012-01-10", "2012-01-11", "2012-01-12", "2012-01-13",
>>>"2012-01-14", "2012-01-15", "2012-01-16", "2012-01-17", "2012-01-18",
>>>"2012-01-19", "2012-01-20", "2012-01-21", "2012-01-22", "2012-01-23"
>>>), class = "factor"), so2 = c(0, 0, 0, 0, 0, 0), nox = c(13.74758511,
>>>105.8060582, 61.22720599, 11.45280354, 56.86804174, 39.17917222
>>>), no = c(0.882593766, 48.97037506, 9.732937217, 1.794549972,
>>>16.32300019, 8.883637786), no2 = c(11.80447753, 25.35235381,
>>>28.72990261, 8.590004034, 31.9003796, 25.50512403), co = c(0.113954917,
>>>0.305985964, 0.064001839, 0, 1.86e-05, 0), o3 = c(5.570499897,
>>>9.802379608, 5.729360104, 11.91304016, 12.13407993, 10.00961971
>>>), ipa = c(6.065110207, 116.9079971, 93.21240234, 10.5777998,
>>>66.40740204, 34.47359848), ws = c(0.122115001, 0.367668003, 0.494913995,
>>>0.627124012, 0.473895013, 0.593913019), wd = c(238.485119317031,
>>>221.645073036776, 220.372076815032, 237.868340917096, 209.532933617465,
>>>215.752030286564), temp = c(4.044159889, 1.176810026, 0.142934993,
>>>0.184606999, -0.935989976, -2.015399933), umr = c(72.29229736,
>>>88.69879913, 87.49530029, 24.00079918, 44.8852005, 49.47729874
>>>)), .Names = c("date", "so2", "nox", "no", "no2", "co", "o3",
>>>"ipa", "ws", "wd", "temp", "umr"), row.names = c(NA, 6L), class =
>>>"data.frame")
>>>
>>>
>>>c<-structure(list(date = structure(1:6, .Label = c("2012-01-03",
>>>"2012-01-04", "2012-01-05", "2012-01-06", "2012-01-07", "2012-01-08",
>>>"2012-01-09", "2012-01-10", "2012-01-11", "2012-01-12", "2012-01-13",
>>>"2012-01-14", "2012-01-15", "2012-01-16", "2012-01-17", "2012-01-18",
>>>"2012-01-19", "2012-01-20", "2012-01-21", "2012-01-22", "2012-01-23"
>>>), class = "factor"), so2 = c(2.617839247, 0, 0, 0.231044086,
>>>0.944608887, 2.12400444), nox = c(308.9046313, 275.6778849, 390.0824142,
>>>178.7429364, 238.655832, 251.892601), no = c(156.0262489, 151.4412498,
>>>221.0725021, 65.96049786, 106.541748, 119.3471241), no2 = c(74.80145447,
>>>59.29991481, 66.5897975, 77.84267978, 75.68422569, 85.43044816
>>>), co = c(1.628431197, 1.716231492, 1.264678366, 1.693460745,
>>>0.780637084, 0.892724398), o3 = c(26.1473999, 15.91584015, 22.46199989,
>>>37.39400101, 15.63426018, 17.51494026), ipa = c(538.414978, 406.4620056,
>>>432.6459961, 275.2820129, 435.7909851, 436.8039856), ws = c(4.995530128,
>>>1.355309963, 1.708899975, 3.131690025, 1.546270013, 1.571320057
>>>), wd = c(58.15639877, 64.5657153143848, 39.9754269501381,
>>>24.0739884380921,
>>>55.9453098437477, 56.7648829092446), temp = c(10.24740028, 7.052690029,
>>>4.33258009, 13.91609955, 8.762220383, 11.04300022), umr = c(97.60900116,
>>>96.91899872, 96.20649719, 94.74620056, 82.04550171, 89.41320038
>>>)), .Names = c("date", "so2", "nox", "no", "no2", "co", "o3",
>>>"ipa", "ws", "wd", "temp", "umr"), row.names = c(NA, 6L), class =
>>>"data.frame")
>>>
>>>
>>>Given the three data frames ³a², ³b² and ³c², I need to merge them all by
>>>the common field ³date².
>>>The following attempt is working fine butŠ
>>>
>>># code start
>>>all<-merge(a,b,by="date",suffixes=c(".a",".b"),all=T)
>>>all<-merge(all,c,by="date",all=T)
>>># code end
>>>
>>>ŠI would like to possibly do it in just ³one shot² (a generalisation of
>>>the
>>>code for a more complex case of handling many data frames) also by
>>>assigning
>>>proper suffixes to each variable (not possible with the previous code
>>>snippet)
>>>
>>>Then I also try a different approach with the use of the library reshape
>>>and
>>>the function ³merge_all² butŠ
>>>
>>># code start
>>>library("reshape")
>>>all.new<-merge_all(a, b, c, by="date", all=T, suffixes=c(".a",".b",".c"))
>>># code end
>>>
>>>ŠI got the following error message:
>>>error in merge in merge.data.frame(as.data.frame(x), as.data.frame(y),
>>>...)
>>>:
>>>  formal argument "all" associated to different argument passed
>>>(a free translation from italian)
>>>
>>>My question is:
>>>how to accomplish the merging of multiple data frames with all the same
>>>variable names and by a single id field with the need of ³keeping track²
>>>of
>>>the original data frame by means of the use of suffixes appended to new
>>>variables?
>>>
>>>Any help much appreciated
>>>
>>>Thank you
>>>
>>>
>>>
>>>
>>>--
>>>View this message in context:
>>>http://r.789695.n4.nabble.com/merge-multiple-data-frames-tp4331089p433108
>>>9
>>>.html
>>>Sent from the R help mailing list archive at Nabble.com.
>>>
>>>______________________________________________
>>>R-help at r-project.org mailing list
>>>https://stat.ethz.ch/mailman/listinfo/r-help
>>>PLEASE do read the posting guide
>>>http://www.R-project.org/posting-guide.html
>>>and provide commented, minimal, self-contained, reproducible code.
>>
>
>



More information about the R-help mailing list