[R-pkg-devel] Re-building vignettes had CPU time 9.2 times elapsed time

Simon Urbanek @|mon@urb@nek @end|ng |rom R-project@org
Sat Aug 26 02:05:06 CEST 2023



> On Aug 26, 2023, at 11:01 AM, Dirk Eddelbuettel <edd using debian.org> wrote:
> 
> 
> On 25 August 2023 at 18:45, Duncan Murdoch wrote:
> | The real problem is that there are two stubborn groups opposing each 
> | other:  the data.table developers and the CRAN maintainers.  The former 
> | think users should by default dedicate their whole machine to 
> | data.table.  The latter think users should opt in to do that.
> 
> No, it feels more like it is CRAN versus the rest of the world.
> 


In reality it's more people running R on their laptops vs the rest of the world. Although people with laptops are the vast majority, they also are the least impacted by the decision going either way. I think Jeff summed up the core reasoning pretty well. Harm is done by excessive use, not other other way around.

That said, I think this thread is really missing the key point: there is no central mechanism that would govern the use of CPU resources. OMP_THREAD_LIMIT is just one of may ways and even that is vastly insufficient for reasons discussed (e.g, recursive use of processes). It is not CRAN's responsibility to figure out for each package what it needs to behave sanely - it has no way of knowing what type of parallelism is used, under which circumstances and how to control it. Only the package author knows that (hopefully), which is why it's on them. So instead of complaining here better use of time would be to look at what's being used in packages and come up with a unified approach to monitoring core usage and a mechanism by which the packages could self-govern to respect the desired limits. If there was one canonical place, it would be also easy for users to opt in/out as they desire - and I'd be happy to help if any components of it need to be in core R.



> Take but one example, and as I may have mentioned elsewhere, my day job consists in providing software so that (to take one recent example) bioinformatics specialist can slice huge amounts of genomics data.  When that happens on a dedicated (expensive) hardware with dozens of cores, it would be wasteful to have an unconditional default of two threads. It would be the end of R among serious people, no more, no less. Can you imagine how the internet headlines would go: "R defaults to two threads". 
> 

If you run on such a machine then you or your admin certainly know how to set the desired limits. From experience the problem is exactly the opposite - it's far more common for users to not know how to not overload such a machine. As for internet headlines, they will always be saying blatantly false things like "R is not for large data" even though we have been using it to analyze terabytes of data per minute ...

Cheers,
Simon



> And it is not just data.table as even in the long thread over in its repo we have people chiming in using OpenMP in their code (as data.table does but which needs a different setter than the data.table thread count).
> 
> It is the CRAN servers which (rightly !!) want to impose constraints for when packages are tested.  Nobody objects to that.
> 
> But some of us wonder if settings these defaults for all R user, all the time, unconditional is really the right thing to do.  Anyway, Uwe told me he will take it to an internal discussion, so let's hope sanity prevails.
> 



More information about the R-package-devel mailing list