[Rd] infelicity in `na.print = ""` for numeric columns of data frames/formatting numeric values
Ben Bolker
bbo|ker @end|ng |rom gm@||@com
Mon Jun 5 19:27:19 CEST 2023
On 2023-06-05 9:27 a.m., Martin Maechler wrote:
>>>>>> Ben Bolker
>>>>>> on Sat, 3 Jun 2023 13:06:41 -0400 writes:
>
> > format(c(1:2, NA)) gives the last value as "NA" rather than
> > preserving it as NA, even if na.encode = FALSE (which does the
> > 'expected' thing for character vectors, but not numeric vectors).
>
> > This was already brought up in 2008 in
> > https://bugs.r-project.org/show_bug.cgi?id=12318 where Gregor Gorjanc
> > pointed out the issue. Documentation was added and the bug closed as
> > invalid. GG ended with:
>
> >> IMHO it would be better that na.encode argument would also have an
> > effect for numeric like vectors. Nearly any function in R returns NA
> > values and I expected the same for format, at least when na.encode=FALSE.
>
> > I agree!
>
> I do too, at least "in principle", keeping in mind that
> backward compatibility is also an important principle ...
>
> Not sure if the 'na.encode' argument should matter or possibly a
> new optional argument, but "in principle" I think that
>
> format(c(1:2, NA, 4))
>
> should preserve is.na(.) even by default.
I would say it should preserve `is.na` *only* if na.encode = FALSE -
that seems like the minimal appropriate change away from the current
behaviour.
>
> > I encountered this in the context of printing a data frame with
> > na.print = "", which works as expected when printing the individual
> > columns but not when printing the whole data frame (because
> > print.data.frame calls format.data.frame, which calls format.default
> > ...). Example below.
>
> > It's also different from what you would get if you converted to
> > character before formatting and printing:
>
> > print(format(as.character(c(1:2, NA)), na.encode=FALSE), na.print ="")
>
> > Everything about this is documented (if you look carefully enough),
> > but IMO it violates the principle of least surprise
> > https://en.wikipedia.org/wiki/Principle_of_least_astonishment , so I
> > would call it at least an 'infelicity' (sensu Bill Venables)
>
> > Is there any chance that this design decision could be revisited?
>
> We'd have to hear other opinions / gut feelings.
>
> Also, someone (not me) would ideally volunteer to run
> 'R CMD check <pkg>' for a few 1000 (not necessarily all) CRAN &
> BioC packages with an accordingly patched version of R-devel
> (I might volunteer to create such a branch, e.g., a bit before the R
> Sprint 2023 end of August).
I might be willing to do that, although it would be nice if there
were a pre-existing framework (analogous to r-lib/revdepcheck) for
automating it and collecting the results ...
>
>
> > cheers
> > Ben Bolker
>
>
> > ---
>
> The following issue you are raising
> may really be a *different* one, as it involves format() and
> print() methods for "data.frame", i.e.,
>
> format.data.frame() vs
> print.data.frame()
>
> which is quite a bit related, of course, to how 'numeric'
> columns are formatted -- as you note yourself below;
> I vaguely recall that the data.frame method could be an even
> "harder problem" .. but I don't remember the details.
>
> It may also be that there are no changes necessary to the
> *.data.frame() methods, and only the documentation (you mention)
> should be updated ...
I *think* that if format.default() were changed so that
na.encode=FALSE also applied to numeric types, then data frame printing
would naturally work 'right' (since print.data.frame calls
format.data.frame which calls format() for the individual columns
specifying encode=FALSE ...)
>
> Martin
>
> > Consider
>
> > dd <- data.frame(f = factor(1:2), c = as.character(1:2), n =
> > as.numeric(1:2), i = 1:2)
> > dd[3,] <- rep(NA, 4)
> > print(dd, na.print = "")
>
>
> > print(dd, na.print = "")
> > f c n i
> > 1 1 1 1 1
> > 2 2 2 2 2
> > 3 NA NA
>
> > This is in fact as documented (see below), but seems suboptimal given
> > that printing the columns separately with na.print = "" would
> > successfully print the NA entries as blank even in the numeric columns:
>
> > invisible(lapply(dd, print, na.print = ""))
> > [1] 1 2
> > Levels: 1 2
> > [1] "1" "2"
> > [1] 1 2
> > [1] 1 2
>
> > * ?print.data.frame documents that it calls format() for each column
> > before printing
> > * the code of print.data.frame() shows that it calls format.data.frame()
> > with na.encode = FALSE
> > * ?format.data.frame specifically notes that na.encode "only applies to
> > elements of character vectors, not to numerical, complex nor logical
> > ‘NA’s, which are always encoded as ‘"NA"’.
>
> > So the NA values in the numeric columns become "NA" rather than
> > remaining as NA values, and are thus printed rather than being affected
> > by the na.print argument.
>
> > ______________________________________________
> > R-devel using r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
More information about the R-devel
mailing list