[R] rounding down with as.integer

Thu Jan 1 22:28:25 CET 2015

I've been followeing this little tour round the murkier bistros
in the back-streets of R with interest! Then it occurred to me:
What is wrong with [using example data]:

  x0 <- c(0,1,2,0.325,1.12,1.9,1.003)
  x1 <- as.integer(as.character(1000*x0))
  n1 <- c(0,1000,2000,325,1120,1900,1003)

  x1 - n1
  ## [1] 0 0 0 0 0 0 0

  ## But, of course:
  1000*x0 - n1
  ## [1]  0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00
  ## [5]  0.000000e+00  0.000000e+00 -1.136868e-13

Or am I missing somthing else in what Mike Miller is seeking to do?
Ted.

On 01-Jan-2015 19:58:02 Mike Miller wrote:
> I'd have to say thanks, but no thanks, to that one!  ;-)  The problem is 
> that it will take a long time and it will give the same answer.
> 
> The first time I did this kind of thing, a year or two ago, I manipulated 
> the text data to produce integers before putting the data into R.  The 
> data were a little different -- already zero padded with three digits to 
> the right of the decimal and one to the left, so all I had to do was drop 
> the decimal point.  The as.integer(1000*x+.5) method is very fast and it 
> works great.
> 
> I could have done that this time, but I was also saving to other formats, 
> so I had the data already in the format I described.
> 
> Mike
> 
> 
> On Thu, 1 Jan 2015, Richard M. Heiberger wrote:
> 
>> Interesting.  Following someone on this list today the goal is input
>> the data correctly.
>> My inclination would be to read the file as text, pad each number to
>> the right, drop the decimal point,
>> and then read it as an integer.
>> 0 1 2 0.325 1.12 1.9
>> 0.000 1.000 2.000 0.325 1.120 1.900
>> 0000 1000 2000 0325 1120 1900
>>
>> The pad step is the interesting step.
>>
>> ## 0 1 2 0.325 1.12 1.9
>> ## 0.000 1.000 2.000 0.325 1.120 1.900
>> ## 0000 1000 2000 0325 1120 1900
>>
>> x.in <- scan(text="
>> 0 1 2 0.325 1.12 1.9 1.
>> ", what="")
>>
>> padding <- c(".000", "000", "00", "0", "")
>>
>> x.pad <- paste(x.in, padding[nchar(x.in)], sep="")
>>
>> x.nodot <- sub(".", "", x.pad, fixed=TRUE)
>>
>> x <- as.integer(x.nodot)
>>
>>
>> Rich
>>
>>
>> On Thu, Jan 1, 2015 at 1:21 PM, Mike Miller <mbmiller+l at gmail.com> wrote:
>>> On Thu, 1 Jan 2015, Duncan Murdoch wrote:
>>>
>>>> On 31/12/2014 8:44 PM, David Winsemius wrote:
>>>>>
>>>>>
>>>>> On Dec 31, 2014, at 3:24 PM, Mike Miller wrote:
>>>>>
>>>>>> This is probably a FAQ, and I don't really have a question about it, but
>>>>>> I just ran across this in something I was working on:
>>>>>>
>>>>>>> as.integer(1000*1.003)
>>>>>>
>>>>>> [1] 1002
>>>>>>
>>>>>> I didn't expect it, but maybe I should have.  I guess it's about the
>>>>>> machine precision added to the fact that as.integer always rounds down:
>>>>>>
>>>>>>
>>>>>>> as.integer(1000*1.003 + 255 * .Machine$double.eps)
>>>>>>
>>>>>> [1] 1002
>>>>>>
>>>>>>> as.integer(1000*1.003 + 256 * .Machine$double.eps)
>>>>>>
>>>>>> [1] 1003
>>>>>>
>>>>>>
>>>>>> This does it right...
>>>>>>
>>>>>>> as.integer( round( 1000*1.003 ) )
>>>>>>
>>>>>> [1] 1003
>>>>>>
>>>>>> ...but this seems to always give the same answer and it is a little
>>>>>> faster in my application:
>>>>>>
>>>>>>> as.integer( 1000*1.003 + .1 )
>>>>>>
>>>>>> [1] 1003
>>>>>>
>>>>>>
>>>>>> FYI - I'm reading in a long vector of numbers from a text file with no
>>>>>> more than three digits to the right of the decimal.  I'm converting them
>>>>>> to
>>>>>> integers and saving them in binary format.
>>>>>>
>>>>>
>>>>> So just add 0.0001 or even .0000001 to all of them and coerce to integer.
>>>>
>>>>
>>>> I don't think the original problem was stated clearly, so I'm not sure
>>>> whether this is a solution, but it looks wrong to me.  If you want to
>>>> round
>>>> to the nearest integer, why not use round() (without the as.integer
>>>> afterwards)?  Or if you really do want an integer, why add 0.1 or 0.0001,
>>>> why not add 0.5 before calling as.integer()?  This is the classical way to
>>>> implement round().
>>>>
>>>> To state the problem clearly, I'd like to know what result is expected for
>>>> any real number x.  Since R's numeric type only approximates the real
>>>> numbers we might not be able to get a perfect match, but at least we could
>>>> quantify how close we get.  Or is the input really character data?  The
>>>> original post mentioned reading numbers from a text file.
>>>
>>>
>>>
>>> Maybe you'd like to know what I'm really doing.  I have 1600 text files
>>> each
>>> with up to 16,000 lines with 3100 numbers per line, delimited by a single
>>> space.  The numbers are between 0 and 2, inclusive, and they have up to
>>> three digits to the right of the decimal.  Every possible value in that
>>> range will occur in the data.  Some examples numbers: 0 1 2 0.325 1.12 1.9.
>>> I want to multiply by 1000 and store them as 16-bit integers (uint16).
>>>
>>> I've been reading in the data like so:
>>>
>>>> data <- scan( file=FILE, what=double(), nmax=3100*16000)
>>>
>>>
>>> At first I tried making the integers like so:
>>>
>>>> ptm <- proc.time() ; ints <- as.integer( 1000 * data ) ; proc.time()-ptm
>>>
>>>    user  system elapsed
>>>   0.187   0.387   0.574
>>>
>>> I decided I should compare with the result I got using round():
>>>
>>>> ptm <- proc.time() ; ints2 <- as.integer( round( 1000 * data ) ) ;
>>>> proc.time()-ptm
>>>
>>>    user  system elapsed
>>>   1.595   0.757   2.352
>>>
>>> It is a curious fact that only a few of the values from 0 to 2000 disagree
>>> between the two methods:
>>>
>>>> table( ints2[ ints2 != ints ] )
>>>
>>>
>>>  1001  1003  1005  1007  1009  1011  1013  1015  1017  1019  1021  1023
>>> 35651 27020 15993 11505  8967  7549  6885  6064  5512  4828  4533  4112
>>>
>>> I understand that it's all about the problem of representing digital
>>> numbers
>>> in binary, but I still find some of the results a little surprising, like
>>> that list of numbers from the table() output.  For another example:
>>>
>>>> 1000+3 - 1000*(1+3/1000)
>>>
>>> [1] 1.136868e-13
>>>
>>>> 3 - 1000*(0+3/1000)
>>>
>>> [1] 0
>>>
>>>> 2000+3 - 1000*(2+3/1000)
>>>
>>> [1] 0
>>>
>>> See what I mean?  So there is something special about the numbers around
>>> 1000.
>>>
>>> Back to the quesion at hand:  I can avoid use of round() and speed things
>>> up
>>> a little bit by just adding a small number after multiplying by 1000:
>>>
>>>> ptm <- proc.time() ; R3 <- as.integer( 1000 * data + .1 ) ;
>>>> proc.time()-ptm
>>>
>>>    user  system elapsed
>>>   0.224   0.594   0.818
>>>
>>> You point out that adding .5 makes sense.  That is probably a better idea
>>> and I should take that approach under most conditions, but in this case we
>>> can add anything between 2e-13 and about 0.99999999999 and always get the
>>> same answer.  We also have to remember that if a number might be negative
>>> (not a problem for me in this application), we need to subtract 0.5 instead
>>> of adding it.
>>>
>>> Anyway, right now this is what I'm actually doing:
>>>
>>>> con <- file( paste0(FILE, ".uint16"), "wb" )
>>>> ptm <- proc.time() ; writeBin( as.integer( 1000 * scan( file=FILE,
>>>> what=double(), nmax=3100*16000 ) + .1 ), con, size=2 ) ; proc.time()-ptm
>>>
>>> Read 48013406 items
>>>    user  system elapsed
>>>  10.263   0.733  10.991
>>>>
>>>> close(con)
>>>
>>>
>>> By the way, writeBin() is something that I learned about here, from you,
>>> Duncan.  Thanks for that, too.
>>>
>>> Mike
>>>
>>> --
>>> Michael B. Miller, Ph.D.
>>> University of Minnesota
>>> http://scholar.google.com/citations?user=EV_phq4AAAAJ
>>>
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at wlandres.net>
Date: 01-Jan-2015  Time: 21:28:22
This message was sent by XFMail