[Rd] write.csv performance improvements?

Gabriel Becker g@bembecker @end|ng |rom gm@||@com
Fri Mar 31 00:50:35 CEST 2023


Hi Toby et al,



On Wed, Mar 29, 2023 at 10:24 PM Toby Hocking <tdhock5 using gmail.com> wrote:

> Dear R-devel,
> I did a systematic comparison of write.csv with similar functions, and
> observed two asymptotic inefficiencies that could be improved.
>
> 1. write.csv is quadratic time (N^2) in the number of columns N.
> Can write.csv be improved to use a linear time algorithm, so it can handle
> CSV files with larger numbers of columns?
>

Yes, I think there is a narrow fix and a wider discussion to be had.

I've posted a discussion and the narrow fix at:
https://bugs.r-project.org/show_bug.cgi?id=18500

For "normal data", ie data that doesn't have classed object columns, the
narrow change I propose in the patch us the performance we might expect
(see the attached, admittedly very ugly plots).

The fact remains though, that with the patch, write.table is still
quadratic in the number of *object-classed *columns.

It doesn't seem like it should be, but I haven't (yet) had a chance to dig
deeper to attack that.  Might be a good subject for the R developer sprint,
if R-core agrees.

~G

> For more details including figures and session info, please see
> https://github.com/tdhock/atime/issues/9
>
> 2. write.csv uses memory that is linear in the number of rows, whereas
> similar R functions for writing CSV use only constant memory. This is not
> as important of an issue to fix, because anyway linear memory is used to
> store the data in R. But since the other functions use constant memory,
> could write.csv also? Is there some copying happening that could be
> avoided? (this memory measurement uses bench::mark, which in turn uses
> utils::Rprofmem)
> https://github.com/tdhock/atime/issues/10
>
> Sincerely,
> Toby Dylan Hocking
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: csv_ncols_time_4.2.2.png
Type: image/png
Size: 19358 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20230330/957ed575/attachment.png>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: csv_nobjcols_time_4.2.2.png
Type: image/png
Size: 21430 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20230330/957ed575/attachment-0001.png>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: csv_ncols_time.png
Type: image/png
Size: 20187 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20230330/957ed575/attachment-0002.png>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: csv_nobjcols_time.png
Type: image/png
Size: 21411 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20230330/957ed575/attachment-0003.png>


More information about the R-devel mailing list