[Rd] Speeding up build-from-source

Simon Urbanek simon.urbanek at r-project.org
Sun Apr 28 00:50:17 CEST 2013


On Apr 27, 2013, at 11:34 AM, Adam Seering wrote:

> 
> 
> On 04/27/2013 09:10 AM, Martin Morgan wrote:
>> On 04/26/2013 07:50 AM, Adam Seering wrote:
>>> Hi,
>>>     I've been playing around with the R source code a little; mostly
>>> just
>>> trying to familiarize myself.  I have access to some computers on a
>>> reservation
>>> system; so I've been reserving a computer, downloading and compiling
>>> R, and
>>> going from there.
>>> 
>>>     I'm finding that R takes a long time to build, though.  (Well,
>>> ok, maybe 5
>>> minutes -- I'm impatient :-) )  Most of that time, it's sitting there
>>> byte-compiling some internal package or another, which uses just one
>>> CPU core so
>>> leaves the system mostly idle.
>>> 
>>>     I'm just curious if anyone has thought about parallelizing that
>>> process?
>> 
>> Hi Adam -- parallel builds are supported by adding the '-j' flag when
>> you invoke make
>> 
>>   make -j
>> 
>> The packages are being built in parallel, in as much as this is possible
>> by their dependency structure. Also, you can configure without byte
>> compilation, see ~/src/R-devel/configure --help to make this part of the
>> build go more quickly. And after an initial build subsets of R, e.g.,
>> just the 'main' source or a single package like 'stats', can be built
>> with (assuming R's source, e.g., from svn, is in ~/src/R-devel, and
>> you're building R in ~/bin/R-devel) with
>> 
>>   cd ~/bin/R-devel/src/main
>>   make -j
>>   cd ~/bin/R-devel/src/library/stats
>>   make -j
>> 
>> The definitive source for answers to questions like these is
>> 
>>   > RShowDoc("R-admin")
>> 
>> Martin
> 
> Hi Martin,
> 	Thanks for the reply -- but I'm afraid the question you've answered isn't the question that I intended to ask.
> 
> 	Based on your response, I think the answer to my question is likely "no."  But let me try rephrasing anyway, just in case:
> 
> 	I'm certainly quite aware of "-j" as a make argument; if I weren't, the bottleneck would not be the byte-compilation, and the build would take rather more than 5 minutes :-)  That was the very first thing I tried. I don't believe that parallel make is as parallel as it theoretically could be.  (In fact, I see almost no parallelism between libraries on my system; individual .c files are parallelized nicely but only one library at a time.  This mostly matters at the compiling-bytecode step, since that's the biggest serial operation per library.)  My question is, has anyone thought about what it would take to parallelize the build further?
> 

I think you may have failed to notice that installation of packages *is* parallelized. The *output* is shown only en-bloc and to avoid mixing outputs of the parallel installations. But there are dependencies among packages, so those that require most of the others have to be built last -- nonetheless, in the current R you can install 9 recommended packages in parallel.


> 	I'm not sure that this can be done with just the makefiles.  But the following comment makes me at least a little suspicious:
> 
> """ src/library/Makefile
> ## FIXME: do some of this in parallel?
> """
> 
> 	Surely some of the 'for' loops there could be unwound into proper make targets with dependency information?  I'm not sure if the dependency information would effectively force a serial compilation anyway, though?...
> 
> 	Another approach, if the above is hard for some reason:  What I'm seeing is that the byte compilation is largely serial; but as you note, byte-compilation is optional.  Could the makefiles just defer it?; skip it up front and then do all the byte-compilations for all of the packages concurrently?

The problem is, again, dependencies - you cannot defer the compilation since it would change the package *after* is has already been used by another package which can cause inconsistencies (note that lazy loading is a red herring - it's used regardless of compilation). That said, you won't save significant amount of time anyway (did you actually profile the time or are you relying on your eyes to deceive you? ;)), so it's not worth the bother (try enabling LTO ;)).

Personally, I simply disable package compilation for all developments builds. You won't notice the difference for testing anyway. Moreover, you'll be barely doing a full build repeatedly, so the 4 minutes it takes are certainly nothing compared to other projects of such size... It becomes more fun when you start building all CRAN packages ;).

Cheers,
Simon


>  From a very cursory read of the code, it looks like the relevant code is in src/library/tools/R/makeLazyLoad.R?; and that file doesn't immediately look like it's doing anything that fundamentally couldn't be parallelized?  (ie., running multiple R processes at once, one per library; at a glance the logic looks nicely per-library.)
> 
> 	A third approach could be to try to parallelize the logic in makeLazyLoad.R.  I would expect that to be at best much more difficult, though.
> 
> 	Anyway, there are lots of things that look like they could in theory be done here.  And I know just enough at this point to be dangerous; not enough to contribute :-)  Hence my asking, has anyone thought about this?  If not, I assume the best thing for me to do would be to poke at it; try to figure out own my own how this works and what's most feasible.  But if anyone has any pointers, that would likely save me a bunch of time.  And if this is something that you prefer to keep serial for some reason, that would be good to know too, so I don't spend time on it.
> 
> Thanks,
> Adam
> 
> 
>> 
>> 
>>> 
>>> Thanks,
>>> Adam
>>> 
>>> ______________________________________________
>>> R-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>> 
>> 
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> 



More information about the R-devel mailing list