[R] merging/intersecting 2 data frames

jim holtman jholtman at gmail.com
Tue Jun 29 21:31:25 CEST 2010


use 'merge'

> a.df
        DATE GENDER PATIENT_ID AGE             SYNDROME
1  4/16/2009      F      23686  45         RASH ON BODY
2  4/16/2009      F      13840  35         CANT URINATE
3  4/16/2009      M      12895  30       BLURRED VISION
4  4/16/2009      M      18375  33       UNABLE TO VOID
5  4/16/2009      M       2237  44         SOB WEAKNESS
6  4/16/2009      F      21484  41 TOOTH PAINTOOTH PAIN
7  4/16/2009      M      10783  37          RT ARM PAIN
8  4/16/2009      M      12610  65        L FOOT INJURY
9  4/16/2009      F       3495  29 URINARY DIFFICULTIES
10 4/16/2009      F        351  36           PT STS MVA
> b.df
   DATE_OF_DEATH    ID
1      4/19/2009 23686
2      4/19/2009 13840
3      4/19/2009 12895
4      4/19/2009 18375
5      4/19/2009   351
6      4/20/2009  3495
7      4/20/2009  4084
8      4/20/2009 19616
9      4/20/2009 17965
10     4/20/2009 11863
> merge(a.df, b.df, by.x="PATIENT_ID", by.y="ID")
  PATIENT_ID      DATE GENDER AGE             SYNDROME DATE_OF_DEATH
1        351 4/16/2009      F  36           PT STS MVA     4/19/2009
2       3495 4/16/2009      F  29 URINARY DIFFICULTIES     4/20/2009
3      12895 4/16/2009      M  30       BLURRED VISION     4/19/2009
4      13840 4/16/2009      F  35         CANT URINATE     4/19/2009
5      18375 4/16/2009      M  33       UNABLE TO VOID     4/19/2009
6      23686 4/16/2009      F  45         RASH ON BODY     4/19/2009
>


On Tue, Jun 29, 2010 at 3:21 PM, Erin Hodgess <erinm.hodgess at gmail.com> wrote:
> Dear R People:
>
> I have two data frames, a.df and b.df as seen here:
>
>> a.df[1:10,]
>        DATE GENDER PATIENT_ID AGE             SYNDROME
> 1  4/16/2009      F      23686  45         RASH ON BODY
> 2  4/16/2009      F      13840  35         CANT URINATE
> 3  4/16/2009      M      12895  30       BLURRED VISION
> 4  4/16/2009      M      18375  33       UNABLE TO VOID
> 5  4/16/2009      M       2237  44         SOB WEAKNESS
> 6  4/16/2009      F      21484  41 TOOTH PAINTOOTH PAIN
> 7  4/16/2009      M      10783  37          RT ARM PAIN
> 8  4/16/2009      M      12610  65        L FOOT INJURY
> 9  4/16/2009      F       3495  29 URINARY DIFFICULTIES
> 10 4/16/2009      F        351  36           PT STS MVA
>> b.df[1:10,]
>   DATE_OF_DEATH    ID
> 1      4/19/2009 21676
> 2      4/19/2009 13717
> 3      4/19/2009 20498
> 4      4/19/2009 14281
> 5      4/19/2009 38848
> 6      4/20/2009   331
> 7      4/20/2009  4084
> 8      4/20/2009 19616
> 9      4/20/2009 17965
> 10     4/20/2009 11863
>>
>
> a.df will always be larger than b.df.
>
> I want to create a third data frame that is matched on PATIENT_ID from
> a.df and ID from b.df.
>
> If there is no match from a.df$PATIENT_ID to b.df$ID, then we omit the
> row from the new data.frame.
>
> If there is a match, we include the DATE_OF_DEATH column from b.df.
>
> I've tried all kinds of tricks, but nothing works exactly as I wish.
>
> Thanks in advance,
> Sincerely,
> Erin
>
>
> --
> Erin Hodgess
> Associate Professor
> Department of Computer and Mathematical Sciences
> University of Houston - Downtown
> mailto: erinm.hodgess at gmail.com
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?



More information about the R-help mailing list