[Rd] type.convert and doubles

Duncan Murdoch murdoch.duncan at gmail.com
Sun Apr 27 18:26:31 CEST 2014


On 27/04/2014, 10:16 AM, Hadley Wickham wrote:
> Is there a reason it's a factor and not a string? A string would seem to be
> more appropriate to me (given that we know it's a number that can't be
> represented exactly by R)

The user asked that anything which can't be converted to a number should 
be converted to a factor.

Yes, that's a bad default, but some people rely on it.

Duncan Murdoch

>
> Hadley
>
> On Saturday, April 26, 2014, Martin Maechler <maechler at stat.math.ethz.ch>
> wrote:
>
>>>>>>> Simon Urbanek <simon.urbanek at r-project.org <javascript:;>>
>>>>>>>      on Sat, 19 Apr 2014 13:06:15 -0400 writes:
>>
>>      > On Apr 19, 2014, at 9:00 AM, Martin Maechler <
>> maechler at stat.math.ethz.ch <javascript:;>> wrote:
>>      >>>>>>> McGehee, Robert <Robert.McGehee at geodecapital.com<javascript:;>
>>>
>>      >>>>>>> on Thu, 17 Apr 2014 19:15:47 -0400 writes:
>>      >>
>>      >>>> This is all application specific and
>>      >>>> sort of beyond the scope of type.convert(), which now behaves as
>> it
>>      >>>> has been documented to behave.
>>      >>
>>      >>> That's only a true statement because the documentation was changed
>> to reflect the new behavior! The new feature in type.convert certainly does
>> not behave according to the documentation as of R 3.0.3. Here's a snippit:
>>      >>
>>      >>> The first type that can accept all the
>>      >>> non-missing values is chosen (numeric and complex return values
>>      >>> will represented approximately, of course).
>>      >>
>>      >>> The key phrase is in parentheses, which reminds the user to expect
>> a possible loss of precision. That important parenthetical was removed from
>> the documentation in R 3.1.0 (among other changes).
>>      >>
>>      >>> Putting aside the fact that this introduces a large amount of
>> unnecessary work rewriting SQL / data import code, SQL packages, my biggest
>> conceptual problem is that I can no longer rely on a particular function
>> call returning a particular class. In my example querying stock prices,
>> about 5% of prices came back as factors and the remaining 95% as numeric,
>> so we had random errors popping in throughout the morning.
>>      >>
>>      >>> Here's a short example showing us how the new behavior can be
>> unreliable. I pass a character representation of a uniformly distributed
>> random variable to type.convert. 90% of the time it is converted to
>> "numeric" and 10% it is a "factor" (in R 3.1.0). In the 10% of cases in
>> which type.convert converts to a factor the leading non-zero digit is
>> always a 9. So if you were expecting a numeric value, then 1 in 10 times
>> you may have a bug in your code that didn't exist before.
>>      >>
>>      >>>> options(digits=16)
>>      >>>> cl <- NULL; for (i in 1:10000) cl[i] <-
>> class(type.convert(format(runif(1))))
>>      >>>> table(cl)
>>      >>> cl
>>      >>> factor numeric
>>      >>> 990    9010
>>      >>
>>      >> Yes.
>>      >>
>>      >> Murray's point is valid, too.
>>      >>
>>      >> But in my view, with the reasoning we have seen here,
>>      >> *and* with the well known software design principle of
>>      >> "least surprise" in mind,
>>      >> I also do think that the default for type.convert() should be what
>>      >> it has been for > 10 years now.
>>      >>
>>
>>      > I think there should be two separate discussions:
>>
>>      > a) have an option (argument to type.convert and possibly read.table)
>> to enable/disable this behavior. I'm strongly in favor of this.
>>
>> In my (not committed) version of R-devel, I now have
>>
>>   > str(type.convert(format(1/3, digits=17), exact=TRUE))
>>    Factor w/ 1 level "0.33333333333333331": 1
>>   > str(type.convert(format(1/3, digits=17), exact=FALSE))
>>    num 0.333
>>
>> where the 'exact' argument name has been ``imported'' from the
>> underlying C code.
>>
>> [ As we CRAN package writers know by now, arguments nowadays can
>>    hardly be abbreviated anymore, and so I am not open to longer
>>    alternative argument names, as someone liking blind typing, I'm
>>    not fond of camel case or other keyboard gymnastics (;-) but if someone
>> has a great idea for
>>    a better argument name.... ]
>>
>> Instead of only  TRUE/FALSE, we could consider NA with
>> semantics "FALSE + warning" or also "TRUE + warning".
>>
>>
>>      > b) decide what the default for a) will be. I have no strong opinion,
>> I can see arguments in both directions
>>
>> I think many have seen the good arguments in both directions.
>> I'm still strongly advocating that we value long term stability
>> higher here, and revert to more compatibility with the many
>> years of previous versions.
>>
>> If we'd use a default of 'exact=NA', I'd like it to mean
>> FALSE + warning, but would not oppose much to  TRUE + warning.
>>
>> I agree that for the TRUE case, it may make more sense to return
>> string-like object of a new (simple) class such as  "bignum"
>> that was mentioned in this thread.
>>
>> OTOH, this functionality should make it into an R 3.1.1 in the
>> not so distant future, and thinking through consequences and
>> implementing the new class approach may just take a tad too much
>> time...
>>
>> Martin
>>
>>      > But most importantly I think a) is better than the status quo - even
>> if the discussion about b) drags out.
>>
>>      > Cheers,
>>      > Simon
>>
>> ______________________________________________
>> R-devel at r-project.org <javascript:;> mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>
>



More information about the R-devel mailing list