[Rd] parallel PSOCK connection latency is greater on Linux?

Mon Nov 2 14:28:55 CET 2020

Could TCP_NODELAY and TCP_QUICKACK be exposed to the R user so that 
they might determine what is best for their potentially latency- or 
throughput-sensitive application?

Best,
Jeff

On Mon, Nov 2, 2020 at 14:05, Iñaki Ucar <iucar using fedoraproject.org> 
wrote:
> On Mon, 2 Nov 2020 at 02:22, Simon Urbanek 
> <simon.urbanek using r-project.org> wrote:
>> 
>>  It looks like R sockets on Linux could do with TCP_NODELAY -- 
>> without (status quo):
> 
> How many network packets are generated with and without it? If there
> are many small writes and thus setting TCP_NODELAY causes many small
> packets to be sent, it might make more sense to set TCP_QUICKACK
> instead.
> 
> Iñaki
> 
>>  Unit: microseconds
>>                     expr      min       lq     mean  median       uq 
>>      max
>>   clusterEvalQ(cl, iris) 1449.997 43991.99 43975.21 43997.1 44001.91 
>> 48027.83
>>   neval
>>    1000
>> 
>>  exactly the same machine + R but with TCP_NODELAY enabled in 
>> R_SockConnect():
>> 
>>  Unit: microseconds
>>                     expr     min     lq     mean  median      uq     
>>  max neval
>>   clusterEvalQ(cl, iris) 156.125 166.41 180.8806 170.247 174.298 
>> 5322.234  1000
>> 
>>  Cheers,
>>  Simon
>> 
>> 
>>  > On 2/11/2020, at 3:39 AM, Jeff <jeff using vtkellers.com> wrote:
>>  >
>>  > I'm exploring latency overhead of parallel PSOCK workers and 
>> noticed that serializing/unserializing data back to the main R 
>> session is significantly slower on Linux than it is on Windows/MacOS 
>> with similar hardware. Is there a reason for this difference and is 
>> there a way to avoid the apparent additional Linux overhead?
>>  >
>>  > I attempted to isolate the behavior with a test that simply 
>> returns an existing object from the worker back to the main R 
>> session.
>>  >
>>  > library(parallel)
>>  > library(microbenchmark)
>>  > gcinfo(TRUE)
>>  > cl <- makeCluster(1)
>>  > (x <- microbenchmark(clusterEvalQ(cl, iris), times = 1000, unit = 
>> "us"))
>>  > plot(x$time, ylab = "microseconds")
>>  > head(x$time, n = 10)
>>  >
>>  > On Windows/MacOS, the test runs in 300-500 microseconds depending 
>> on hardware. A few of the 1000 runs are an order of magnitude slower 
>> but this can probably be attributed to garbage collection on the 
>> worker.
>>  >
>>  > On Linux, the first 5 or so executions run at comparable speeds 
>> but all subsequent executions are two orders of magnitude slower 
>> (~40 milliseconds).
>>  >
>>  > I see this behavior across various platforms and hardware 
>> combinations:
>>  >
>>  > Ubuntu 18.04 (Intel Xeon Platinum 8259CL)
>>  > Linux Mint 19.3 (AMD Ryzen 7 1800X)
>>  > Linux Mint 20 (AMD Ryzen 7 3700X)
>>  > Windows 10 (AMD Ryzen 7 4800H)
>>  > MacOS 10.15.7 (Intel Core i7-8850H)
>>  >
>>  > ______________________________________________
>>  > R-devel using r-project.org mailing list
>>  > https://stat.ethz.ch/mailman/listinfo/r-devel
>>  >
>> 
>>  ______________________________________________
>>  R-devel using r-project.org mailing list
>>  https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> 
> 
> --
> Iñaki Úcar