[R] Data Table Merge Help

Graeve, Nick Graeve.Nick at principal.com
Thu Feb 1 18:45:53 CET 2018


I'm not sure if this is an appropriate use of this mailing list or not, please let me know if it isn't.  I'm struggling to figure out how to merge two data tables based on max effective date logic compared to when a payment occurred.  My dtDistributions DT is a transactional dataset while dtDepartments is a domain data set containing all department names and the effective date of when department name changes have occurred.  For the Bob example below, there was a payment on 2016-01-01 which occurred in H229000.  In 2012, this department was named "Modified Name", in 2019 the department will be named "Final Name".  When I merge these two tables, I'd like it to pull the transactional data and match it up to department name "Modified Name" since that was the active department name at the time of that transaction.  I've read documentation on foverlaps, but I'm not sure if this problem is considered a range of dates or not.  At the bottom of this post is a temporarily solution that is working but it runs for a long time due to the amount of data in my actual source.

Here is some sample data to get started:
dtDistributions <- data.table(PayeeName = c("Bob", "Tracy", "Tom"),
                          Department = factor(c("H229000", "H135000", "H047800")),
                          Amount = c(5, 34, 87),
                          PaymentDT = as.Date(c("2016-01-01", "2015-01-01", "2015-01-01")))

dtDepartments <- data.table(Department = factor(c("H229000", "H229000", "H229000", "H135000", "H047800")),
                        EffDT = as.Date(c("2019-01-01", "2012-01-01", "1901-01-01", "1901-01-01", "1901-01-01")),
                        Descr = c("Final Name","Modified Name","Original Name","Payables","Postal"))

Here is the output I would like to see:
PayeeName  Department     PaymentDT   Amount
Bob        Modified Name  2016-01-01  5
Tracy      Payables       2015-01-01  34
Tom        Postal         2015-01-01  87

I was able to get this working by using the sqldf library, but it runs for a very long time in my actual dataset and I'd like to use data.table if at all possible.
joinString <- "SELECT A.PayeeName, B.Descr, A.PaymentDT, A.Amount
            FROM dtDistributions A, dtDepartments B
            WHERE A.DEPARTMENT = B.Department
            AND B.EffDT = (SELECT MAX(ED.EffDT)
                            FROM dtDepartments ED
                            WHERE B.Department = ED.Department
                            AND ED.EffDT <= A.PaymentDT)"

finalDT <- data.table(sqldf(joinString))

-----Message Disclaimer-----

This e-mail message is intended only for the use of the individual or entity to which it is addressed, and may contain information that is privileged, confidential and exempt from disclosure under applicable law. If you are not the intended recipient, any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify us immediately by reply email to Connect at principal.com and delete or destroy all copies of the original message and attachments thereto. Email sent to or from the Principal Financial Group or any of its member companies may be retained as required by law or regulation.

Nothing in this message is intended to constitute an Electronic signature for purposes of the Uniform Electronic Transactions Act (UETA) or the Electronic Signatures in Global and National Commerce Act ("E-Sign") unless a specific statement to the contrary is included in this message.

If you no longer wish to receive any further solicitation from the Principal Financial Group you may unsubscribe at https://www.principal.com/do-not-contact-form any time.

If you are a Canadian resident and no longer wish to receive commercial electronic messages you may unsubscribe at https://www.principal.com/do-not-email-request-canadian-residents any time.

This message was secured by Zix(R).

More information about the R-help mailing list