[R] strangely long floating point with write.table()

Duncan Murdoch murdoch.duncan at gmail.com
Mon Mar 17 23:38:47 CET 2014


On 14-03-17 6:22 PM, Mike Miller wrote:
> On Mon, 17 Mar 2014, Berend Hasselman wrote:
>
>> On 17-03-2014, at 21:03, Mike Miller <mbmiller+l at gmail.com> wrote:
>>
>>> …...
>>> data[,c(5:9,11,13,17:21)] <- signif(data[,c(5:9,11,13,17:21)], digits=5)
>>>
>>> Then write.table(data) does what I'd want.  It works better than format(). Example:
>>>
>>>> data2 <- data
>>>> data2[,c(5:9,11,13,17:21)] <- signif(data2[,c(5:9,11,13,17:21)], digits=5)
>>>>
>>>> write.table(format(data[1:10,], digits=5, trim=T), row.names=F, col.names=F, quote=F)
>>> 3100674 303164 6 1 -0.11869237 0.0073947 0.0084493 0.00012708 -0.1320 1 0 TT 1 GA 0 0 2 0.000 0 0.000 0.00000
>>> 3100765 303321 6 1 0.01434426 -0.0136545 -0.0017613 0.08502718 1.0365 1 1 CT 1 GA 1 0 1 0.000 0 1.000 1.00000
>>> 3101201 304352 6 1 -0.01710451 -0.0169568 0.0320392 0.00884896 0.4279 1 1 CT 2 GG 1 0 1 0.000 0 1.000 1.00000
>>> 3101862 305250 6 1 -0.01328316 0.0108479 -0.0170081 -0.03692398 -0.4470 1 0 TT 1 GA 0 0 2 0.000 1 1.000 1.00000
>>> 3103579 305847 6 1 0.01593935 0.0096043 -0.0437904 -0.02224669 -0.3365 1 0 TT 1 GA 0 0 2 0.000 0 0.000 0.00000
>>> 3103645 305961 6 1 0.20441289 -0.1090142 0.2727132 -0.29890268 1.5818 1 2 CC 0 AA 2 0 0 0.000 0 2.000 4.00000
>>> 3104098 308536 6 1 0.02842117 0.0562814 -0.0715448 -0.11510562 0.9974 1 0 TT 0 AA 0 1 1 0.944 0 0.944 0.89114
>>> 3104361 306928 6 1 -0.04840401 0.0266719 -0.0548747 -0.03640484 0.4499 1 0 TT 0 AA 0 0 2 0.000 1 1.000 1.00000
>>> 5100094 503136 6 1 0.19702704 -0.4104611 0.0869569 -0.03952420 0.3057 1 2 CC 0 AA 2 0 0 0.000 0 2.000 4.00000
>>> 5100938 503615 6 1 0.00098838 0.0267176 0.0451301 0.04790277 -0.1743 2 1 CT 0 AA 1 0 1 0.000 0 1.000 1.00000
>>>>
>>>> write.table(data2[1:10,], row.names=F, col.names=F, quote=F)
>>> 3100674 303164 6 1 -0.11869 0.0073947 0.0084493 0.00012708 -0.132 1 0 TT 1 GA 0 0 2 0 0 0 0
>>> 3100765 303321 6 1 0.014344 -0.013654 -0.0017613 0.085027 1.0365 1 1 CT 1 GA 1 0 1 0 0 1 1
>>> 3101201 304352 6 1 -0.017105 -0.016957 0.032039 0.008849 0.4279 1 1 CT 2 GG 1 0 1 0 0 1 1
>>> 3101862 305250 6 1 -0.013283 0.010848 -0.017008 -0.036924 -0.447 1 0 TT 1 GA 0 0 2 0 1 1 1
>>> 3103579 305847 6 1 0.015939 0.0096043 -0.04379 -0.022247 -0.3365 1 0 TT 1 GA 0 0 2 0 0 0 0
>>> 3103645 305961 6 1 0.20441 -0.10901 0.27271 -0.2989 1.5818 1 2 CC 0 AA 2 0 0 0 0 2 4
>>> 3104098 308536 6 1 0.028421 0.056281 -0.071545 -0.11511 0.9974 1 0 TT 0 AA 0 1 1 0.944 0 0.944 0.89114
>>> 3104361 306928 6 1 -0.048404 0.026672 -0.054875 -0.036405 0.4499 1 0 TT 0 AA 0 0 2 0 1 1 1
>>> 5100094 503136 6 1 0.19703 -0.41046 0.086957 -0.039524 0.3057 1 2 CC 0 AA 2 0 0 0 0 2 4
>>> 5100938 503615 6 1 0.00098838 0.026718 0.04513 0.047903 -0.1743 2 1 CT 0 AA 1 0 1 0 0 1 1
>>>
>>> format() with digits=5 is still showing 7 significant digits.  Why? signif() only shows 5.
>>
>>
>>  From the help of format:
>>
>> digits "how many significant digits are to be used for numeric and
>> complex x. The default, NULL, uses getOption("digits"). This is a
>> suggestion: enough decimal places will be used so that the smallest (in
>> magnitude) number has this many significant digits, and also to satisfy
>> nsmall. (For the interpretation for complex numbers see signif.)”
>>
>> So if I read this correctly the smallest number will have 5 significant
>> digits. Larger numbers may get more. Given the fixed width (see argument
>> trim).
>
>
> Thanks!  Another thing I've figured out:  Use of "drop0trailing=T" in
> format() fixes the .00000 stuff that I didn't like:
>
>> write.table(format(data[1:10,], digits=5, trim=T, drop0trailing=T), row.names=F, col.names=F, quote=F)
> 3100674 303164 6 1 -0.11869237 0.0073947 0.0084493 0.00012708 -0.132 1 0 TT 1 GA 0 0 2 0 0 0 0
> 3100765 303321 6 1 0.01434426 -0.0136545 -0.0017613 0.08502718 1.0365 1 1 CT 1 GA 1 0 1 0 0 1 1
> 3101201 304352 6 1 -0.01710451 -0.0169568 0.0320392 0.00884896 0.4279 1 1 CT 2 GG 1 0 1 0 0 1 1
> 3101862 305250 6 1 -0.01328316 0.0108479 -0.0170081 -0.03692398 -0.447 1 0 TT 1 GA 0 0 2 0 1 1 1
> 3103579 305847 6 1 0.01593935 0.0096043 -0.0437904 -0.02224669 -0.3365 1 0 TT 1 GA 0 0 2 0 0 0 0
> 3103645 305961 6 1 0.20441289 -0.1090142 0.2727132 -0.29890268 1.5818 1 2 CC 0 AA 2 0 0 0 0 2 4
> 3104098 308536 6 1 0.02842117 0.0562814 -0.0715448 -0.11510562 0.9974 1 0 TT 0 AA 0 1 1 0.944 0 0.944 0.89114
> 3104361 306928 6 1 -0.04840401 0.0266719 -0.0548747 -0.03640484 0.4499 1 0 TT 0 AA 0 0 2 0 1 1 1
> 5100094 503136 6 1 0.19702704 -0.4104611 0.0869569 -0.0395242 0.3057 1 2 CC 0 AA 2 0 0 0 0 2 4
> 5100938 503615 6 1 0.00098838 0.0267176 0.0451301 0.04790277 -0.1743 2 1 CT 0 AA 1 0 1 0 0 1 1
>
> That's pretty close to the signif() output I was getting (above) but with
> a few digits added because of the small numbers (as you explained).
>
> format() with trim=T seems to just delete the spaces that format() would
> have added for column alignment.  It doesn't seem to affect the number of
> digits displayed.
>
> I still have more to figure out, but for most smaller table-writing jobs,
> I think something like the last command above will be my usual approach.
> In real life, I would use a tab delimiter, though.
>
> I'm still unsure about the best way for dealing with very large data
> frames, though.  There's probably a good way to stream data into a file so
> that it doesn't have to be written as an additional large object in
> memory.  There must be a way to make a connection and then just pipe the
> formatted data into it.  Maybe something related to sprintf() will work.

You've never explained why you want to write these gigantic text files. 
  Text is a lossy way to store numbers:  it takes 15 bytes to store 
about 8 bytes of information, and you'll probably lose a few bits at the 
end.  Why not write your files in binary, storing exactly what you have 
in memory?  It'll be a lot faster to write and to read, you won't need 
to duplicated before writing, etc.

Duncan Murdoch




More information about the R-help mailing list