[Rd] Large vector support in data.frames
Jan van der Laan
rhe|p @end|ng |rom eoo@@dd@@n|
Thu Jul 4 08:38:01 CEST 2024
Ivan, Simon,
Thanks for the replies.
I can work around the limitation. I currently either divide the data
into shards or use a list with (long) vectors depending on what I am
trying to do. But I have to transform between the two representations
which takes time and memory and often need more code than I would have
if I could have used data.frames.
Being able to create large (> 2^31-1 rows) data.frames and doing some
basic things like selecting rows and columns, would already be really
nice. That would also allow package maintainers to start supporting
these data.frames. I imagine getting large data.frames working in
functions like lm, is not trivial and lm might not support this any time
soon. However, a package like biglm might.
But from what you are saying, I get the impression that this is not
something that is being actively worked on. I must say, my hands a kind
of itching to try.
Best,
Jan
On 03-07-2024 09:22, Simon Urbanek wrote:
> The second point is not really an issue - R already uses numerics for larger-than-32-bit indexing at R level and it works just fine for objects up to ca. 72 petabytes.
>
> However, the first one is a bit more relevant than one would think. At one point I have experimented with allowing data frames with more than 2^31 rows, but it breaks in many places - some quite unexpected. Beside dim() there is also the issue with (non-expanded) row names. Overall, it is a lot more work - some would have to be done in R but some would require changes to packages as well.
>
> (In practice I use sharded data frames for large data which removes the limit and allows parallel processing - but requires support from the methods that will be applied to them).
>
> Cheers,
> Simon
>
>
>
>> On Jul 2, 2024, at 16:04, Ivan Krylov via R-devel <r-devel using r-project.org> wrote:
>>
>> В Wed, 19 Jun 2024 09:52:20 +0200
>> Jan van der Laan <rhelp using eoos.dds.nl> пишет:
>>
>>> What is the status of supporting long vectors in data.frames (e.g.
>>> data.frames with more than 2^31 records)? Is this something that is
>>> being worked on? Is there a time line for this? Is this something I
>>> can contribute to?
>>
>> Apologies if you've already received a better answer off-list.
>>
>> From from my limited understanding, the problem with supporting
>> larger-than-(2^31-1) dimensions has multiple facets:
>>
>> - In many parts of R code, there's the assumption that dim() is
>> of integer type. That wouldn't be a problem by itself, except...
>>
>> - R currently lacks a native 64-bit integer type. About a year ago
>> Gabe Becker mentioned that Luke Tierney has been considering
>> improvements in this direction, but it's hard to introduce 64-bit
>> integers without making the user worry even more about data types
>> (numeric != integer != 64-bit integer) or introducing a lot of
>> overhead (64-bit integers being twice as large as 32-bit ones and,
>> depending on the workload, frequently redundant).
>>
>> - Two-dimensional objects eventually get transformed into matrices and
>> handed to LAPACK for linear algebra operations. Currently, the
>> interface used by R to talk to BLAS and LAPACK only supports 32-bit
>> signed integers for lengths. 64-bit BLASes and LAPACKs do exist
>> (e.g. OpenBLAS can be compiled with 64-bit lengths), but we haven't
>> taught R to use them.
>>
>> (This isn't limited to array dimensions, by the way. If you try to
>> svd() a 40000 by 40000 matrix, it'll try to ask for temporary memory
>> with length that overflows a signed 32-bit integer, get a much
>> shorter allocation instead, promptly overflow the buffer and
>> crash the process.)
>>
>> As you see, it's interconnected; work on one thing will involve the
>> other two.
>>
>> --
>> Best regards,
>> Ivan
>>
>> ______________________________________________
>> R-devel using r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>
More information about the R-devel
mailing list