[R] duplicated() on zero-column data frames returns empty

Mon Apr 8 19:03:00 CEST 2024

I appreciate the compliment from Ivan and still share the puzzlement at the empty return.

What is the policy for changing something that is wrong? There is a trade-off between breaking old code that worked around a problem and breaking new code written by people who make reasonable assumptions. Mathematically, it seems obvious to me that duplicated.matrix(A) should do something like this:

v <- matrix(FALSE, nrow = nrow(A) -> nr, ncol=1L) # or an ordinary vector?
if (nr > 1L) # Check because 2:0 & 2:1 do not do what we want.
{ for (i in 2:nr)
  { for (j in 1:(i-1))
    if (identical(A[i,],A[j,])) # or something more complicated to handle incomparables
    { v[i] <- TRUE; break}
  }
}
v

Of course my code is horribly inefficient, but the difference should be just in computing the same result faster. An empty vector of some type is identical to an empty vector of the same type, so this computes

      [,1]

[1,] FALSE

[2,]  TRUE

[3,]  TRUE

[4,]  TRUE

[5,]  TRUE
, and I argue that that is correct.

A gap in documentation makes a change to the correct behaviour easier. (If the current behaviour were documented then the first step in changing the behaviour would be to issue a warning that the change is coming in a future version.) The protection for old code could be just a warning that can be turned off with a call to options. The new documentation should be more explicit.

Regards,
Jorgen.

From: Mark Webster <markwebster204 using yahoo.co.uk>
To: Jorgen Harmse <jharmse using roku.com>, Ivan Krylov
        <ikrylov using disroot.org>
Cc: "r-help using r-project.org" <r-help using r-project.org>
Subject: Re: [R] duplicated() on zero-column data frames returns empty
Message-ID: <603481690.9150754.1712522666289 using mail.yahoo.com>
Content-Type: text/plain; charset="utf-8"

 duplicated.matrix is an interesting one. I think a similar change would make sense, because it would have the dimensions that people would expect when using the default MARGIN = 1. However, it could be argued that it's not a needed change, because the Value section of its documentation only guarantees the dimensions of the output when using MARGIN = 0. In that case, duplicated.matrix does indeed return the expected 5x0 matrix for your example:
str(duplicated(matrix(0, 5, 0), MARGIN = 0))# logi[1:5, 0 ]
Best Regards,
Mark Webster
        [[alternative HTML version deleted]]

From: Mark Webster markwebster204 using yahoo.co.uk<mailto:markwebster204 using yahoo.co.uk>
To: Ivan Krylov ikrylov using disroot.org<mailto:ikrylov using disroot.org>,  r-help using r-project.org<mailto:r-help using r-project.org>
        r-help using r-project.org<mailto:r-help using r-project.org>
Subject: Re: [R]  duplicated() on zero-column data frames returns
        empty vector
Message-ID: 1379736116.7985600.1712306452176 using mail.yahoo.com<mailto:1379736116.7985600.1712306452176 using mail.yahoo.com>
Content-Type: text/plain; charset="utf-8"

 Do you mean the row names should mean all the rows should be counted as non-duplicates?Yes, I can see the argument for that, thanks.I must say I'm still puzzled at what interpretation would motivate the current behaviour of returning a logical(0), however.

Date: Sun, 7 Apr 2024 11:00:51 +0300
From: Ivan Krylov <ikrylov using disroot.org<mailto:ikrylov using disroot.org>>
To: Jorgen Harmse <JHarmse using roku.com<mailto:JHarmse using roku.com>>
Cc: "r-help using r-project.org<mailto:r-help using r-project.org>" <r-help using r-project.org<mailto:r-help using r-project.org>>,
        "markwebster204 using yahoo.co.uk<mailto:markwebster204 using yahoo.co.uk>" <markwebster204 using yahoo.co.uk<mailto:markwebster204 using yahoo.co.uk>>
Subject: Re: [R] duplicated() on zero-column data frames returns empty
Message-ID: 20240407110051.7924c03c using Tarkus<mailto:20240407110051.7924c03c using Tarkus>
Content-Type: text/plain; charset="utf-8"

� Fri, 5 Apr 2024 16:08:13 +0000
Jorgen Harmse <JHarmse using roku.com<mailto:JHarmse using roku.com>> �����:

> if duplicated really treated a row name as part of the row then
> any(duplicated(data.frame(�))) would always be FALSE. My expectation
> is that if key1 is a subset of key2 then all(duplicated(df[key1]) >=
> duplicated(df[key2])) should always be TRUE.

That's a good argument, thank you!

Would you suggest similar changes to duplicated.matrix too? Currently
it too returns 0-length output for 0-column inputs:

# 0-column matrix for 0-column input
str(duplicated(matrix(0, 5, 0)))
# logi[1:5, 0 ]

# 1-column matrix for 1-column input
str(duplicated(matrix(0, 5, 1)))
# logi [1:5, 1] FALSE TRUE TRUE TRUE TRUE

# a dim-1 array for >1-column input
str(duplicated(matrix(0, 5, 10)))
# logi [1:5(1d)] FALSE TRUE TRUE TRUE TRUE

--
Best regards,
Ivan

	[[alternative HTML version deleted]]