[Rd] infelicity in `na.print = ""` for numeric columns of data frames/formatting numeric values

Ben Bolker bbo|ker @end|ng |rom gm@||@com
Mon Jun 5 19:27:19 CEST 2023



On 2023-06-05 9:27 a.m., Martin Maechler wrote:
>>>>>> Ben Bolker
>>>>>>      on Sat, 3 Jun 2023 13:06:41 -0400 writes:
> 
>      > format(c(1:2, NA)) gives the last value as "NA" rather than
>      > preserving it as NA, even if na.encode = FALSE (which does the
>      > 'expected' thing for character vectors, but not numeric vectors).
> 
>      > This was already brought up in 2008 in
>      > https://bugs.r-project.org/show_bug.cgi?id=12318 where Gregor Gorjanc
>      > pointed out the issue. Documentation was added and the bug closed as
>      > invalid. GG ended with:
> 
>      >> IMHO it would be better that na.encode argument would also have an
>      > effect for numeric like vectors. Nearly any function in R returns NA
>      > values and I expected the same for format, at least when na.encode=FALSE.
> 
>      > I agree!
> 
> I do too, at least "in principle", keeping in mind that
> backward compatibility is also an important principle ...
> 
> Not sure if the 'na.encode' argument should matter or possibly a
> new optional argument, but "in principle" I think that
> 
>    format(c(1:2, NA, 4))
> 
> should preserve is.na(.) even by default.

    I would say it should preserve `is.na` *only* if na.encode = FALSE - 
that seems like the minimal appropriate change away from the current 
behaviour.

> 
>      > I encountered this in the context of printing a data frame with
>      > na.print = "", which works as expected when printing the individual
>      > columns but not when printing the whole data frame (because
>      > print.data.frame calls format.data.frame, which calls format.default
>      > ...).  Example below.
> 
>      > It's also different from what you would get if you converted to
>      > character before formatting and printing:
> 
>      > print(format(as.character(c(1:2, NA)), na.encode=FALSE), na.print ="")
> 
>      > Everything about this is documented (if you look carefully enough),
>      > but IMO it violates the principle of least surprise
>      > https://en.wikipedia.org/wiki/Principle_of_least_astonishment , so I
>      > would call it at least an 'infelicity' (sensu Bill Venables)
> 
>      > Is there any chance that this design decision could be revisited?
> 
> We'd have to hear other opinions / gut feelings.
> 
> Also, someone (not me) would ideally volunteer to run
> 'R CMD check <pkg>' for a few 1000 (not necessarily all) CRAN &
> BioC packages with an accordingly patched version of R-devel
> (I might volunteer to create such a branch, e.g., a bit before the R
>   Sprint 2023 end of August).

   I might be willing to do that, although it would be nice if there 
were a pre-existing framework (analogous to r-lib/revdepcheck) for 
automating it and collecting the results ...


> 
> 
>      > cheers
>      > Ben Bolker
> 
> 
>      > ---
> 
> The following issue you are raising
> may really be a *different* one, as it involves format() and
> print() methods for "data.frame", i.e.,
> 
>     format.data.frame() vs
>      print.data.frame()
> 
> which is quite a bit related, of course, to how 'numeric'
> columns are formatted -- as you note yourself below;
> I vaguely recall that the data.frame method could be an even
> "harder problem" .. but I don't remember the details.
> 
> It may also be that there are no changes necessary to the
> *.data.frame() methods, and only the documentation (you mention)
> should be updated ...


   I *think* that if format.default() were changed so that 
na.encode=FALSE also applied to numeric types, then data frame printing 
would naturally work 'right' (since print.data.frame calls 
format.data.frame which calls format() for the individual columns 
specifying encode=FALSE ...)
> 
> Martin
> 
>      > Consider
> 
>      > dd <- data.frame(f = factor(1:2), c = as.character(1:2), n =
>      > as.numeric(1:2), i = 1:2)
>      > dd[3,] <- rep(NA, 4)
>      > print(dd, na.print = "")
> 
> 
>      > print(dd, na.print = "")
>      >   f c  n  i
>      > 1 1 1  1  1
>      > 2 2 2  2  2
>      > 3     NA NA
> 
>      > This is in fact as documented (see below), but seems suboptimal given
>      > that printing the columns separately with na.print = "" would
>      > successfully print the NA entries as blank even in the numeric columns:
> 
>      > invisible(lapply(dd, print, na.print = ""))
>      > [1] 1 2
>      > Levels: 1 2
>      > [1] "1" "2"
>      > [1] 1 2
>      > [1] 1 2
> 
>      > * ?print.data.frame documents that it calls format() for each column
>      > before printing
>      > * the code of print.data.frame() shows that it calls format.data.frame()
>      > with na.encode = FALSE
>      > * ?format.data.frame specifically notes that na.encode "only applies to
>      > elements of character vectors, not to numerical, complex nor logical
>      > ‘NA’s, which are always encoded as ‘"NA"’.
> 
>      > So the NA values in the numeric columns become "NA" rather than
>      > remaining as NA values, and are thus printed rather than being affected
>      > by the na.print argument.
> 
>      > ______________________________________________
>      > R-devel using r-project.org mailing list
>      > https://stat.ethz.ch/mailman/listinfo/r-devel



More information about the R-devel mailing list