[Rd] R Bug: write.table for matrix of more than 2, 147, 483, 648 elements

Thu Apr 19 12:15:46 CEST 2018

On 04/19/2018 11:47 AM, Serguei Sokol wrote:
> Le 19/04/2018 à 09:30, Tomas Kalibera a écrit :
>> On 04/19/2018 02:06 AM, Duncan Murdoch wrote:
>>> On 18/04/2018 5:08 PM, Tousey, Colton wrote:
>>>> Hello,
>>>>
>>>> I want to report a bug in R that is limiting my capabilities to 
>>>> export a matrix with write.csv or write.table with over 
>>>> 2,147,483,648 elements (C's int limit). I found this bug already 
>>>> reported about before: 
>>>> https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17182. However, 
>>>> there appears to be no solution or fixes in upcoming R version 
>>>> releases.
>>>>
>>>> The error message is coming from the writetable part of the utils 
>>>> package in the io.c source 
>>>> code(https://svn.r-project.org/R/trunk/src/library/utils/src/io.c):
>>>> /* quick integrity check */
>>>>                  if(XLENGTH(x) != (R_len_t)nr * nc)
>>>>                      error(_("corrupt matrix -- dims not not match 
>>>> length"));
>>>>
>>>> The issue is that nr*nc is an integer and the size of my matrix, 
>>>> 2.8 billion elements, exceeds C's limit, so the check forces the 
>>>> code to fail.
>>>
>>> Yes, looks like a typo:  R_len_t is an int, and that's how nr was 
>>> declared.  It should be R_xlen_t, which is bigger on machines that 
>>> support big vectors.
>>>
>>> I haven't tested the change; there may be something else in that 
>>> function that assumes short vectors.
>> Indeed, I think the function won't work for long vectors because of 
>> EncodeElement2 and EncodeElement0. EncodeElement2/0 would have to be 
>> changed, including their signatures
>
> That would be a definite fix but before such deep rewriting is 
> undertaken may the following small fix (in addition to "(R_xlen_t)nr * 
> nc") will be sufficient for cases where nr and nc are in int range but 
> their product can reach long vector limit:
>
> replace
>     tmp = EncodeElement2(x, i + j*nr, quote_col[j], qmethod,
>                     &strBuf, sdec);
> by
>     tmp = EncodeElement2(VECTOR_ELT(x, (R_xlen_t)i + j*nr), 0, 
> quote_col[j], qmethod,
>                     &strBuf, sdec);

Unfortunately we can't do that, x is a matrix of an atomic vector type. 
VECTOR_ELT is taking elements of a generic vector, so it cannot be 
applied to "x". But even if we extracted a single element from "x" (e.g. 
via a type-switch etc), we would not be able to pass it to 
EncodeElement0 which expects a full atomic vector (that is, including 
its header). Instead we would have to call functions like EncodeInteger, 
EncodeReal0, etc on the individual elements. Which is then the same as 
changing EncodeElement0 or implementing a new version of it. This does 
not seem that hard to fix, just is not as trivial as changing the cast..

Tomas

>
> Serguei