[R] strangely long floating point with write.table()
Mike Miller
mbmiller+l at gmail.com
Mon Mar 17 21:03:37 CET 2014
On Sun, 16 Mar 2014, Duncan Murdoch wrote:
> On 14-03-16 2:13 AM, Mike Miller wrote:
>
>> I always knew there was some numerical reason why I was getting very
>> long stretches of 9s or 0s in the write.table() output, but my concern
>> is really with how to prevent that from happening. So the question
>> still is, how do I avoid getting 0.00499999999999989 in my output file
>> when I want 0.005? I'm sure I'm not alone in this. It looks like the
>> standard answer is to use format(). For example, I could do this:
>>
>> write.table(format(data, digits=13, trim=T), file="data.txt", row.names=F, col.names=F, quote=F)
>
> You could also round the numbers to 13 digits before printing, e.g.
>
> write.table(signif(data, digits=13), ...)
>
> (or use round() if you want to specify decimal places instead of
> significant digits).
I like that idea. It can be used in exactly that way only if all of the
variables in the data frame are numeric. I can use signif() on the
numeric variables before using write.table():
data[,c(5:9,11,13,17:21)] <- signif(data[,c(5:9,11,13,17:21)], digits=5)
Then write.table(data) does what I'd want. It works better than format().
Example:
> data2 <- data
> data2[,c(5:9,11,13,17:21)] <- signif(data2[,c(5:9,11,13,17:21)], digits=5)
>
> write.table(format(data[1:10,], digits=5, trim=T), row.names=F, col.names=F, quote=F)
3100674 303164 6 1 -0.11869237 0.0073947 0.0084493 0.00012708 -0.1320 1 0 TT 1 GA 0 0 2 0.000 0 0.000 0.00000
3100765 303321 6 1 0.01434426 -0.0136545 -0.0017613 0.08502718 1.0365 1 1 CT 1 GA 1 0 1 0.000 0 1.000 1.00000
3101201 304352 6 1 -0.01710451 -0.0169568 0.0320392 0.00884896 0.4279 1 1 CT 2 GG 1 0 1 0.000 0 1.000 1.00000
3101862 305250 6 1 -0.01328316 0.0108479 -0.0170081 -0.03692398 -0.4470 1 0 TT 1 GA 0 0 2 0.000 1 1.000 1.00000
3103579 305847 6 1 0.01593935 0.0096043 -0.0437904 -0.02224669 -0.3365 1 0 TT 1 GA 0 0 2 0.000 0 0.000 0.00000
3103645 305961 6 1 0.20441289 -0.1090142 0.2727132 -0.29890268 1.5818 1 2 CC 0 AA 2 0 0 0.000 0 2.000 4.00000
3104098 308536 6 1 0.02842117 0.0562814 -0.0715448 -0.11510562 0.9974 1 0 TT 0 AA 0 1 1 0.944 0 0.944 0.89114
3104361 306928 6 1 -0.04840401 0.0266719 -0.0548747 -0.03640484 0.4499 1 0 TT 0 AA 0 0 2 0.000 1 1.000 1.00000
5100094 503136 6 1 0.19702704 -0.4104611 0.0869569 -0.03952420 0.3057 1 2 CC 0 AA 2 0 0 0.000 0 2.000 4.00000
5100938 503615 6 1 0.00098838 0.0267176 0.0451301 0.04790277 -0.1743 2 1 CT 0 AA 1 0 1 0.000 0 1.000 1.00000
>
> write.table(data2[1:10,], row.names=F, col.names=F, quote=F)
3100674 303164 6 1 -0.11869 0.0073947 0.0084493 0.00012708 -0.132 1 0 TT 1 GA 0 0 2 0 0 0 0
3100765 303321 6 1 0.014344 -0.013654 -0.0017613 0.085027 1.0365 1 1 CT 1 GA 1 0 1 0 0 1 1
3101201 304352 6 1 -0.017105 -0.016957 0.032039 0.008849 0.4279 1 1 CT 2 GG 1 0 1 0 0 1 1
3101862 305250 6 1 -0.013283 0.010848 -0.017008 -0.036924 -0.447 1 0 TT 1 GA 0 0 2 0 1 1 1
3103579 305847 6 1 0.015939 0.0096043 -0.04379 -0.022247 -0.3365 1 0 TT 1 GA 0 0 2 0 0 0 0
3103645 305961 6 1 0.20441 -0.10901 0.27271 -0.2989 1.5818 1 2 CC 0 AA 2 0 0 0 0 2 4
3104098 308536 6 1 0.028421 0.056281 -0.071545 -0.11511 0.9974 1 0 TT 0 AA 0 1 1 0.944 0 0.944 0.89114
3104361 306928 6 1 -0.048404 0.026672 -0.054875 -0.036405 0.4499 1 0 TT 0 AA 0 0 2 0 1 1 1
5100094 503136 6 1 0.19703 -0.41046 0.086957 -0.039524 0.3057 1 2 CC 0 AA 2 0 0 0 0 2 4
5100938 503615 6 1 0.00098838 0.026718 0.04513 0.047903 -0.1743 2 1 CT 0 AA 1 0 1 0 0 1 1
format() with digits=5 is still showing 7 significant digits. Why?
signif() only shows 5. Another thing that is desirable about signif(), at
least for me, in the write.table() output is that a number like 1.00000 is
presented simply as 1. I think I would always want that.
I also think the signif() approach, if I replace some variables with
signif() versions of those variables, doesn't force me to make a really
huge additional data frame.
In R, if I do this...
data <- signif(data, digits=12)
...do I need to have enough memory to hold two copies of the data frame
called "data"? If the answer is "yes," then that is a problem.
I assume that "data" and "signif(data, digits=12)" use the same amount of
memory: 8 bytes per numeric value (double precision), and that is much
better than "format(data, digits=12)" because the numbers then must use
more than 12 bytes each.
Mike
More information about the R-help
mailing list