[Rd] Large vector support in data.frames

@vi@e@gross m@iii@g oii gm@ii@com @vi@e@gross m@iii@g oii gm@ii@com
Thu Jul 4 15:35:15 CEST 2024


Unfortunately, as has been noted, some changes require many parties to change at once and can cause huge problems when an unchanged part is reached. If integers are a fixed size, an implementation can be straightforward and you can patch in libraries and parts already used and tested and in languages like C.

Python is an example where they went another way and the built-in integer type has an indefinite length integer. But that can mess with efficiency so some extensions commonly used for their versions of Dataframe often allow you to specify one of several types of fixed length integer for efficiency.

-----Original Message-----
From: R-devel <r-devel-bounces using r-project.org> On Behalf Of Jan van der Laan
Sent: Thursday, July 4, 2024 2:38 AM
To: r-devel using r-project.org
Subject: Re: [Rd] Large vector support in data.frames

Ivan, Simon,

Thanks for the replies.

I can work around the limitation. I currently either divide the data 
into shards or use a list with (long) vectors depending on what I am 
trying to do. But I have to transform between the two representations 
which takes time and memory and often need more code than I would have 
if I could have used data.frames.

Being able to create large (> 2^31-1 rows) data.frames and doing some 
basic things like selecting rows and columns, would already be really 
nice. That would also allow package maintainers to start supporting 
these data.frames. I imagine getting large data.frames working in 
functions like lm, is not trivial and lm might not support this any time 
soon. However, a package like biglm might.

But from what you are saying, I get the impression that this is not 
something that is being actively worked on. I must say, my hands a kind 
of itching to try.

Best,
Jan



On 03-07-2024 09:22, Simon Urbanek wrote:
> The second point is not really an issue - R already uses numerics for larger-than-32-bit indexing at R level and it works just fine for objects up to ca. 72 petabytes.
> 
> However, the first one is a bit more relevant than one would think. At one point I have experimented with allowing data frames with more than 2^31 rows, but it breaks in many places - some quite unexpected. Beside dim() there is also the issue with (non-expanded) row names. Overall, it is a lot more work - some would have to be done in R but some would require changes to packages as well.
> 
> (In practice I use sharded data frames for large data which removes the limit and allows parallel processing - but requires support from the methods that will be applied to them).
> 
> Cheers,
> Simon
> 
> 
> 
>> On Jul 2, 2024, at 16:04, Ivan Krylov via R-devel <r-devel using r-project.org> wrote:
>>
>> В Wed, 19 Jun 2024 09:52:20 +0200
>> Jan van der Laan <rhelp using eoos.dds.nl> пишет:
>>
>>> What is the status of supporting long vectors in data.frames (e.g.
>>> data.frames with more than 2^31 records)? Is this something that is
>>> being worked on? Is there a time line for this? Is this something I
>>> can contribute to?
>>
>> Apologies if you've already received a better answer off-list.
>>
>>  From from my limited understanding, the problem with supporting
>> larger-than-(2^31-1) dimensions has multiple facets:
>>
>> - In many parts of R code, there's the assumption that dim() is
>>    of integer type. That wouldn't be a problem by itself, except...
>>
>> - R currently lacks a native 64-bit integer type. About a year ago
>>    Gabe Becker mentioned that Luke Tierney has been considering
>>    improvements in this direction, but it's hard to introduce 64-bit
>>    integers without making the user worry even more about data types
>>    (numeric != integer != 64-bit integer) or introducing a lot of
>>    overhead (64-bit integers being twice as large as 32-bit ones and,
>>    depending on the workload, frequently redundant).
>>
>> - Two-dimensional objects eventually get transformed into matrices and
>>    handed to LAPACK for linear algebra operations. Currently, the
>>    interface used by R to talk to BLAS and LAPACK only supports 32-bit
>>    signed integers for lengths. 64-bit BLASes and LAPACKs do exist
>>    (e.g. OpenBLAS can be compiled with 64-bit lengths), but we haven't
>>    taught R to use them.
>>
>>    (This isn't limited to array dimensions, by the way. If you try to
>>    svd() a 40000 by 40000 matrix, it'll try to ask for temporary memory
>>    with length that overflows a signed 32-bit integer, get a much
>>    shorter allocation instead, promptly overflow the buffer and
>>    crash the process.)
>>
>> As you see, it's interconnected; work on one thing will involve the
>> other two.
>>
>> -- 
>> Best regards,
>> Ivan
>>
>> ______________________________________________
>> R-devel using r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>

______________________________________________
R-devel using r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel



More information about the R-devel mailing list