[R-pkg-devel] Re-building vignettes had CPU time 9.2 times elapsed time
Greg Hunt
greg @end|ng |rom ||rm@n@y@h@com
Sun Aug 27 02:18:54 CEST 2023
Tim,
I think that things like data.table have a different set of problems
depending on the environment. Working out what the right degree of
parallelism for an IO workload is is a hard question that depends on the
characteristics of the IO subsystem, the characteristics of the dataset and
on what problem you really have (really how much its worth spending to
achieve an optimal answer). It would be interesting to see how well
data.table would do with several tens of threads on several tens of
processors reading a file, I suspect it might not be pretty (coordination
overheads could be large relative to the actual gains from IO
parallelism), but its not a subject I've looked at. It would not
surprise me if the right answer was to cap the number of threads, but that
cap would probably still be higher than the usual number of processors in
the average physical or virtual box. This stuff is not easy and its
saturated with "it depends" answers. The underlying problem here is that
to get optimal or optimal-enough behaviour, a 96-way or more box
will require different configuration of the software to an 8 or 16-way VM.
Greg
On Sat, 26 Aug 2023 at 18:15, Tim Taylor <tim.taylor using hiddenelephants.co.uk>
wrote:
> I’m definitely sympathetic to both sides but have come around to the view
> of Greg, Dirk et al. It seems sensible to have a default that benefits the
> majority of “normal” users and require explicit action in shared
> environments not vice-versa.
>
> That is not to say that data.table could not do better with it’s
> heuristics (e.g. respecting CGroups settings as raised by Henrik in
> https://github.com/Rdatatable/data.table/issues/5620) but the current
> defaults (50%) seem reasonable for, dare I say, most users.
>
> Tim
>
> On 26 Aug 2023, at 03:20, Greg Hunt <greg using firmansyah.com> wrote:
>
> The question should be, in how many cases is the current behaviour a
> problem? In a shared environment, sure, you have to be more careful. I'd
> say don't let the teenagers in there. The CRAN build server does need to do
> something to protect itself and I don't greatly mind the 2 thread limit, I
> implemented it by hand in my examples and didn't think about it
> afterwards. On most 8, 16 or 32 way environments, dedicated or
> semi-dedicated to a particular workload, the defaults make some level of
> sense and they are probably most of the use cases. Protecting high
> processor count environments from people who don't know what they are doing
> would seem to be a mismatch between the people and the environment, not so
> much a matter of software.
>
> On Sat, 26 Aug 2023 at 11:49, Jeff Newmiller <jdnewmil using dcn.davis.ca.us>
> wrote:
>
> You have a really bizarre way of twisting what others are saying, Dirk. I
>
> have seen no-one here saying 'limit R to 2 threads' except for you, as a
>
> way to paint opposing views to be absurd.
>
>
> What _is_ being said is that users need to be in control_, but _the
>
> default needs to do least harm_ until those users take responsibility for
>
> that control. Do not turn the throttle up until the user is prepared for
>
> the consequences. Trying to subvert that responsibility into packages by
>
> default is going to make more trouble than giving the people using those
>
> packages simple examples of how to take that control.
>
>
> A similar problem happens when users discover .Rprofile and insert all
>
> those pesky library statements into it, making their scripts
>
> irreproducible. If data.table made a warp10() function that activated this
>
> current default performance setting then the user would be clearly at fault
>
> for using it in an inappropriate environment like a shared HPC or the CRAN
>
> servers. Don't put a brick on the accelerator of a teenager's car before
>
> they even figure out where the brakes are.
>
>
> On August 25, 2023 6:17:04 PM PDT, Dirk Eddelbuettel <edd using debian.org>
>
> wrote:
>
>
> On 26 August 2023 at 12:05, Simon Urbanek wrote:
>
> | In reality it's more people running R on their laptops vs the rest of
>
> the world.
>
>
> My point was that we also have 'single user on really Yuge workstation'.
>
>
> Plus we all know that those users are often not sysadmins, and do not have
>
> our levels of accumulated systems knowledge.
>
>
> So we should give _more_ power by default, not less.
>
>
> | [...] they will always be saying blatantly false things like "R is not
>
> for large data"
>
>
> By limiting R (and/or packages) to two threads we will only get more of
>
> these. Our collective call.
>
>
> This whole thread is pretty sad, actually.
>
>
> Dirk
>
>
>
> --
>
> Sent from my phone. Please excuse my brevity.
>
>
> ______________________________________________
>
> R-package-devel using r-project.org mailing list
>
> https://stat.ethz.ch/mailman/listinfo/r-package-devel
>
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-package-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-package-devel
>
>
[[alternative HTML version deleted]]
More information about the R-package-devel
mailing list