[R] strangely long floating point with write.table()

Sun Mar 16 13:00:34 CET 2014

On 14-03-16 2:13 AM, Mike Miller wrote:
> On Sat, 15 Mar 2014, peter dalgaard wrote:
>
>> On 15 Mar 2014, at 20:54 , Mike Miller <mbmiller+l at gmail.com> wrote:
>>
>>> $ cat data1.txt
>>> 0.005
>>> 0.00499999999999989
>>>
>>> I don't know why it shows 17 digits and doesn't round to 15, but it is showing that the numbers are different, for some reason.
>>>
>>
>> Aiding my weakening eyesight a little:
>>
>> 0.004 999 999 999 999 89
>>
>> Notice that that makes 15 _significant_ digits.
>
> OK, now I feel really stupid.  Of course it's 15 mantissa digits, not 15
> %f digits, or whatever that should be called.  Sorry about that.
>
>
>>> Do you understand why there is a difference between 1-0.995 and 2-1.995
>>> in their internal representations?
>>
>> Let's see,  that'll be like
>>
>> 1 - 2/3 vs. 10 - 29/3
>>
>> on a decimal computer if someone is perverse enough to give input in
>> base 3 (i.e., 1.0 - 0.2 ternary vs. 101.0 - 100.2 ternary). Assume that
>> the computer is floating point with 3 significant digits (and possibly
>> taking some liberties compared to what real computers really do), we
>> have
>>
>>    1 = 1.000 * 10^0
>>   10 = 1.000 * 10^1
>> 2/3 = 0.667 * 10^0
>> 29/3 = 0.967 * 10^1
>>
>> 1 - 2/3  = 0.333 * 10^0
>> 10 - 29/3 = 0.033 * 10^1 = 0.330 * 10^0
>>
>> So, yes, I think I do understand how these things can happen.
>
> Yes, and that's a nice explanation, but you had me at "_significant_".  I
> don't know why I didn't get that in the first place.  So the difference in
> my example is that 0.995 is 9.950e-1 so that the 5 is the third
> significant digit and in 1.995, the 5 is the fourth significant digit, so
> 1-0.995 provides a more precise representation of 0.005 than does 2-1.995.
>
> I always knew there was some numerical reason why I was getting very long
> stretches of 9s or 0s in the write.table() output, but my concern is
> really with how to prevent that from happening.  So the question still is,
> how do I avoid getting 0.00499999999999989 in my output file when I want
> 0.005?  I'm sure I'm not alone in this.  It looks like the standard answer
> is to use format().  For example, I could do this:
>
>> write.table(format(data, digits=13, trim=T), file="data.txt", row.names=F, col.names=F, quote=F)

You could also round the numbers to 13 digits before printing, e.g.

write.table(signif(data, digits=13), ...)

(or use round() if you want to specify decimal places instead of 
significant digits).

Duncan Murdoch

>
> That does fix the long numbers -- all of them are reduced to three digits.
> The one thing that concerns me is that when format() is called, isn't it
> making an object that could take up a lot of memory if the data frame is
> large?  The data frame created by format() might use a lot more memory
> than the original data frame if it is converting a lot of doubles (8
> bytes) to a lot of possibly 16-byte strings.  For example, -10/81 takes up
> 8 bytes as a double, but converted by format with digits=13 it uses 16
> bytes to include the sign, the zero and the decimal point (plus a
> delimiter when there are many per line of output):
>
>> write.table(format(-10/81, digits=13), row.names=F, col.names=F, quote=F)
> -0.1234567901235
>
> I'm assuming that write.table() is streaming the data into a file (or
> stdout) and not creating a complete representation of the output in memory
> before it does that.  It looks like format() creates a data frame where
> all variables are converted to character type.  Thus, it wouldn't be just
> for convenience that one might want digits=N to be an option in the
> write.table() function.  It would be very useful with large data frames,
> making it possible to write out things that would be too large to handle
> using format().  When files are already super-large, we really want to
> avoid expanding the number of digits per value in the output.
>
> Mike
>