[Rd] R support for 64 bit integers

Tue Aug 10 18:37:25 CEST 2010

On Tue, 10 Aug 2010, Martin Maechler wrote:

> {Hijacking the thread from from R-help to R-devel -- as I am
> consciously shifting the focus away from the original question
> ...
> }
>
>>>>>> David Winsemius <dwinsemius at comcast.net>
>>>>>>     on Tue, 10 Aug 2010 08:42:12 -0400 writes:
>
>    > On Aug 9, 2010, at 2:45 PM, Theo Tannen wrote:
>
>    >> Are integers strictly a signed 32 bit number on R even if
>    >> I am running a 64 bit version of R on a x86_64 bit
>    >> machine?
>    >>
>    >> I ask because I have integers stored in a hdf5 file where
>    >> some of the data is 64 bit integers. When I read that
>    >> into R using the hdf5 library it seems any integer
>    >> greater than 2**31 returns NA.
>
>    > That's the limit. It's hard coded and not affected by the
>    > memory pointer size.
>
>    >>
>    >> Any solutions?
>
>    > I have heard of packages that handle "big numbers". A bit
>    > of searching produces suggestions to look at gmp on CRAN
>    > and Rmpfr on R-Forge.
>
> Note that Rmpfr has been on CRAN, too, for a while now.
> If you only need large integers (and rationals), 'gmp' is enough
> though.
>
> *However* note that the gmp or Rmpfr (or any other arbitray
> precision) implementation will be considerably slower in usage
> than if there was native 64-bit integer support.
>
> Introducing 64-bit integers natively into "base R" is an
> "interesting" project, notably if we also allowed using them for
> indices, and changed the internal structures to use them instead
> of 32-bit.
> This would allow to free ourselves from the increasingly
> relevant  maximum-atomic-object-length = 2^31 problem.
> The latter is something we have planned to address, possibly for
> R 3.0.
> However, for that, using 64-bit integers is just one
> possibility, another being to use "double precision integers".
> Personally, I'd prefer the "long long" (64-bit) integers quite
> a bit, but there are other considerations, e.g.,
> one big challenge will be to go there in a way such that not
> all R packages using compiled code will have to be patched
> extensively...
> another aspect is how the BLAS / Lapack team will address the
> problem.

At the moment, all the following are the same type:
  length of an R vector
  R integer type
  C int type
  Fortran INTEGER type

The last two are fixed at 32 bits (in practice for C, by standard for Fortran), and we would like the first and perhaps the second to become 64bit.

If both the R length type and the R integer type become the same 64bit type and replace the current integer type then every compiled package has to change to declare the arguments as int64 (or long, on most 64bit systems) and INTEGER*8. That should be all that is needed for most code, since C compilers nowadays already complain if you do unclean things like stuffing an int into a pointer.

If the R length type changes to something /different/ from the integer type then any compiled code has to be checked to see if  C int arguments are lengths or integers, which is more work and more error-prone.

On the other hand, changing the integer type to 64bit will presumably make integer code run noticeably more slowly on 32bit systems.

In both cases, the changes could be postponed by having an option to .C/.Call forcing lengths and integers to be passed as 32-bit. This would mean that the code couldn't use large integers or large vectors, but it would keep working indefinitely.

     -thomas

Thomas Lumley
Professor of Biostatistics
University of Washington, Seattle