[R] Need to compare two columns in two data.frames and return all rows from df where rows values are missing

Jake Elmstedt j@ke@e|m@tedt @end|ng |rom gm@||@com
Tue Jun 15 21:07:29 CEST 2021


Bert,

There are significant disadvantages to your approach. This,
emphatically, is NOT the proper use case of the merge() function.

    set.seed(123)
    df1 <- data.frame(name = sample(letters[1:4], 12, TRUE), score = runif(12))
    df2 <- data.frame(name = sample(letters[3:6], 12, TRUE), score = runif(12))
    out <- merge(df1, df2, by = "name", all = TRUE)

    (out <- merge(df1, df2, by = "name", all = TRUE))
    #>    name    score.x   score.y
    #> 1     a 0.69280341        NA
    #> 2     b 0.04205953        NA
    #> 3     b 0.32792072        NA
    #> 4     b 0.89982497        NA
    #> 5     b 0.95450365        NA
    #> 6     b 0.99426978        NA
    #> 7     c 0.67757064 0.4659625
    #> 8     c 0.67757064 0.4137243
    #> 9     c 0.67757064 0.7584595
    #> 10    c 0.24608773 0.4659625
    #> 11    c 0.24608773 0.4137243
    #> 12    c 0.24608773 0.7584595
    #> 13    c 0.10292468 0.4659625
    #> 14    c 0.10292468 0.4137243
    #> 15    c 0.10292468 0.7584595
    #> 16    c 0.88953932 0.4659625
    #> 17    c 0.88953932 0.4137243
    #> 18    c 0.88953932 0.7584595
    #> 19    c 0.57263340 0.4659625
    #> 20    c 0.57263340 0.4137243
    #> 21    c 0.57263340 0.7584595
    #> 22    d 0.64050681 0.2316258
    #> 23    d 0.64050681 0.4145463
    #> 24    d 0.64050681 0.2330341
    #> 25    d 0.64050681 0.3688455
    #> 26    e         NA 0.1428000
    #> 27    e         NA 0.2164079
    #> 28    e         NA 0.1524447
    #> 29    f         NA 0.3181810
    #> 30    f         NA 0.1388061

After doing this merge (assuming there are no edge cases which I'll
discuss at the end) you'd still need to,

    Drop the incomplete cases, make a new column of the non-NA values,
and drop the two merged columns,

    out2 <- out[!complete.cases(out), ]
    (out2[["score"]] <- with(out2, ifelse(is.na(score.x), score.y, score.x)))
    #>  [1] 0.69280341 0.04205953 0.32792072 0.89982497 0.95450365 0.99426978
    #>  [7] 0.14280002 0.21640794 0.15244475 0.31818101 0.13880606
    (result <- out2[, c(1L, 4L)])
    #>    name      score
    #> 1     a 0.69280341
    #> 2     b 0.04205953
    #> 3     b 0.32792072
    #> 4     b 0.89982497
    #> 5     b 0.95450365
    #> 6     b 0.99426978
    #> 26    e 0.14280002
    #> 27    e 0.21640794
    #> 28    e 0.15244475
    #> 29    f 0.31818101
    #> 30    f 0.13880606

Now, let's talk about one possible edge case. If either original
data.frame object contains an NA value, that row could be kept
inappropriately, as you're assuming anything with an NA should be
kept.

In short, merge() is inappropriate to use here. You can force it to
(somewhat) work, assuming no complications, but it really is a poor
choice for this task.

Created on 2021-06-15 by the [reprex
package](https://reprex.tidyverse.org) (v2.0.0.9000)

On Mon, Jun 14, 2021 at 7:50 PM Bert Gunter <bgunter.4567 using gmail.com> wrote:
>
> merge(..., all = TRUE) essentially does this for you. rows with NA's are the non-matches:
>
> merge(df1,df2,by = "name",all = TRUE)
>    name    score.x   score.y
> 1     a 0.69280341        NA
> 2     b 0.04205953        NA
> 3     b 0.32792072        NA
> 4     b 0.89982497        NA
> 5     b 0.95450365        NA
> 6     b 0.99426978        NA
> 7     c 0.67757064 0.4659625
> 8     c 0.67757064 0.4137243
> 9     c 0.67757064 0.7584595
> 10    c 0.24608773 0.4659625
> 11    c 0.24608773 0.4137243
> 12    c 0.24608773 0.7584595
> 13    c 0.10292468 0.4659625
> 14    c 0.10292468 0.4137243
> 15    c 0.10292468 0.7584595
> 16    c 0.88953932 0.4659625
> 17    c 0.88953932 0.4137243
> 18    c 0.88953932 0.7584595
> 19    c 0.57263340 0.4659625
> 20    c 0.57263340 0.4137243
> 21    c 0.57263340 0.7584595
> 22    d 0.64050681 0.2316258
> 23    d 0.64050681 0.4145463
> 24    d 0.64050681 0.2330341
> 25    d 0.64050681 0.3688455
> 26    e         NA 0.1428000
> 27    e         NA 0.2164079
> 28    e         NA 0.1524447
> 29    f         NA 0.3181810
> 30    f         NA 0.1388061
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along and sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
> On Mon, Jun 14, 2021 at 7:09 PM Jake Elmstedt <jake.elmstedt using gmail.com> wrote:
>>
>> set.seed(123)
>>
>> df1 <- data.frame(name = sample(letters[1:4], 12, TRUE), score = runif(12))
>> head(df1)
>> #>   name      score
>> #> 1    c 0.67757064
>> #> 2    c 0.57263340
>> #> 3    c 0.10292468
>> #> 4    b 0.89982497
>> #> 5    c 0.24608773
>> #> 6    b 0.04205953
>> table(df1[["name"]])
>> #>
>> #> a b c d
>> #> 1 5 5 1
>>
>> df2 <- data.frame(name = sample(letters[3:6], 12, TRUE), score = runif(12))
>> head(df2)
>> #>   name     score
>> #> 1    c 0.7584595
>> #> 2    e 0.2164079
>> #> 3    f 0.3181810
>> #> 4    d 0.2316258
>> #> 5    e 0.1428000
>> #> 6    d 0.4145463
>> table(df2[["name"]])
>> #>
>> #> c d e f
>> #> 3 4 3 2
>>
>> df3 <- rbind(df1[!df1[["name"]] %in% df2[["name"]], ],
>>              df2[!df2[["name"]] %in% df1[["name"]], ])
>> head(df3)
>> #>    name      score
>> #> 4     b 0.89982497
>> #> 6     b 0.04205953
>> #> 7     b 0.32792072
>> #> 8     b 0.95450365
>> #> 10    a 0.69280341
>> #> 12    b 0.99426978
>> table(df3[["name"]])
>> #>
>> #> a b e f
>> #> 1 5 3 2
>>
>> ______________________________________________
>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list