[Rd] Undesirable behaviour of base::factor

peter dalgaard pd@|gd @end|ng |rom gm@||@com
Fri May 24 10:34:05 CEST 2024


I think this is a "Doctor, it hurts when I do this" issue. 

The root of it is that as.character() behaves differently on integers and floating values.

> factor(100000)
[1] 1e+05
Levels: 1e+05

> factor(100000,levels=100000)
[1] 1e+05
Levels: 1e+05

> factor(100000,levels=100000:100000)
[1] <NA>

> factor(as.integer(100000),levels=100000:100000)
[1] 100000
Levels: 100000

Or, more directly: It is the difference between these

> as.character(seq(99999L,100001L,1L))
[1] "99999"  "100000" "100001"
> as.character(seq(99999L,100001L,1))
[1] "99999"  "1e+05"  "100001"

in which the formatting code has detected that "1e+05" is shorter than "100000", but won't convert integers to scientific notation.

You can play whack-a-mole with this sort of issue: Fix a perceived problem in one place only to find a new problem popping up elsewhere. It is probably better just to never trust character conversion of numbers beyond 99999.

- pd



> On 23 May 2024, at 18:33 , Andrew Gustar <andrew_gustar using msn.com> wrote:
> 
> This thread on stackoverflow illustrates the problem... https://stackoverflow.com/questions/78523612/r-factor-from-numeric-vector-drops-every-100-000th-element-from-its-levels
> 
> The issue is that factor(), applied to numeric values, uses as.character(), which converts numbers to character strings according to the value of scipen. The stackoverflow thread illustrates a case where this causes some factor levels to become NA. There is also an inconsistency between the treatment of numeric and integer values.
> 
> On the face of it, using format(..., scientific = FALSE) instead of as.character() would solve the problem, but this probably needs careful thinking through in case of other side effects!
> 
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd.mes using cbs.dk  Priv: PDalgd using gmail.com



More information about the R-devel mailing list