[Rd] Large vector support in data.frames

Ivan Krylov |kry|ov @end|ng |rom d|@root@org
Tue Jul 2 16:04:44 CEST 2024


В Wed, 19 Jun 2024 09:52:20 +0200
Jan van der Laan <rhelp using eoos.dds.nl> пишет:

> What is the status of supporting long vectors in data.frames (e.g. 
> data.frames with more than 2^31 records)? Is this something that is 
> being worked on? Is there a time line for this? Is this something I
> can contribute to?

Apologies if you've already received a better answer off-list.

From from my limited understanding, the problem with supporting
larger-than-(2^31-1) dimensions has multiple facets:

 - In many parts of R code, there's the assumption that dim() is
   of integer type. That wouldn't be a problem by itself, except...

 - R currently lacks a native 64-bit integer type. About a year ago
   Gabe Becker mentioned that Luke Tierney has been considering
   improvements in this direction, but it's hard to introduce 64-bit
   integers without making the user worry even more about data types
   (numeric != integer != 64-bit integer) or introducing a lot of
   overhead (64-bit integers being twice as large as 32-bit ones and,
   depending on the workload, frequently redundant).

 - Two-dimensional objects eventually get transformed into matrices and
   handed to LAPACK for linear algebra operations. Currently, the
   interface used by R to talk to BLAS and LAPACK only supports 32-bit
   signed integers for lengths. 64-bit BLASes and LAPACKs do exist
   (e.g. OpenBLAS can be compiled with 64-bit lengths), but we haven't
   taught R to use them.

   (This isn't limited to array dimensions, by the way. If you try to
   svd() a 40000 by 40000 matrix, it'll try to ask for temporary memory
   with length that overflows a signed 32-bit integer, get a much
   shorter allocation instead, promptly overflow the buffer and
   crash the process.)

As you see, it's interconnected; work on one thing will involve the
other two.

-- 
Best regards,
Ivan



More information about the R-devel mailing list