[Rd] Undesirable behaviour of base::factor
peter dalgaard
pd@|gd @end|ng |rom gm@||@com
Fri May 24 10:34:05 CEST 2024
I think this is a "Doctor, it hurts when I do this" issue.
The root of it is that as.character() behaves differently on integers and floating values.
> factor(100000)
[1] 1e+05
Levels: 1e+05
> factor(100000,levels=100000)
[1] 1e+05
Levels: 1e+05
> factor(100000,levels=100000:100000)
[1] <NA>
> factor(as.integer(100000),levels=100000:100000)
[1] 100000
Levels: 100000
Or, more directly: It is the difference between these
> as.character(seq(99999L,100001L,1L))
[1] "99999" "100000" "100001"
> as.character(seq(99999L,100001L,1))
[1] "99999" "1e+05" "100001"
in which the formatting code has detected that "1e+05" is shorter than "100000", but won't convert integers to scientific notation.
You can play whack-a-mole with this sort of issue: Fix a perceived problem in one place only to find a new problem popping up elsewhere. It is probably better just to never trust character conversion of numbers beyond 99999.
- pd
> On 23 May 2024, at 18:33 , Andrew Gustar <andrew_gustar using msn.com> wrote:
>
> This thread on stackoverflow illustrates the problem... https://stackoverflow.com/questions/78523612/r-factor-from-numeric-vector-drops-every-100-000th-element-from-its-levels
>
> The issue is that factor(), applied to numeric values, uses as.character(), which converts numbers to character strings according to the value of scipen. The stackoverflow thread illustrates a case where this causes some factor levels to become NA. There is also an inconsistency between the treatment of numeric and integer values.
>
> On the face of it, using format(..., scientific = FALSE) instead of as.character() would solve the problem, but this probably needs careful thinking through in case of other side effects!
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
--
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd.mes using cbs.dk Priv: PDalgd using gmail.com
More information about the R-devel
mailing list