[Rd] parallel PSOCK connection latency is greater on Linux?

Tue Nov 2 02:45:28 CET 2021

Hi Gabriel,

Yes, 40 milliseconds (ms) == 40,000 microseconds (us). My benchmarking 
output is reporting the latter, which is considerably higher than the 
40us you are seeing. If I benchmark just the serialization round trip 
as you did, I get comparable results: 14us median on my Linux system. 
So at least on Linux, there is something else contributing the 
remaining 39,986us. The conclusion from earlier in this thread was that 
the culprit was TCP behavior unique to the Linux network stack.

Jeff

On Mon, Nov 1 2021 at 05:55:45 PM -0700, Gabriel Becker 
<gabembecker using gmail.com> wrote:
> Jeff,
> 
> Perhaps I'm just missing something here, but ms is generally 
> milliseconds, not microseconds (which are much smaller), right?
> 
> Also, this seems to just be how long it takes to roundtrip serialize 
> iris (in 4.1.0  on mac osx, as thats what I have handy right this 
> moment):
> 
>> > microbenchmark({x <- unserialize(serialize(iris, connection = 
>> NULL))})
>> 
>> Unit: microseconds
>> 
>>                                                         expr   min   
>>   lq
>> 
>>  {    x <- unserialize(serialize(iris, connection = NULL)) } 35.378 
>> 36.0085
>> 
>>     mean median    uq  max neval
>> 
>>  40.26888 36.4345 43.641 80.39  100
>> 
>> 
> 
>> > res <- system.time(replicate(10000, {x <- 
>> unserialize(serialize(iris, connection = NULL))}))
>> 
>> > res/10000
>> 
>>    user  system elapsed
>> 
>> 4.58e-05 2.90e-06 4.88e-05
>> 
> 
> Thus the overhead appears to be extremely minimal in your results 
> above, right? In fact it seems to be comparable or lower than 
> replicate.
> 
> ~G
> 
> 
> 
> 
> 
> On Mon, Nov 1, 2021 at 5:20 PM Jeff Keller <jeff using vtkellers.com 
> <mailto:jeff using vtkellers.com>> wrote:
>> Hi Simon,
>> 
>>  I see there may have been some changes to address the TCP_NODELAY 
>> issue on Linux in 
>> <https://github.com/wch/r-source/commit/82369f73fc297981e64cac8c9a696d05116f0797>.
>> 
>>  I gave this a try with R 4.1.1, but I still see a 40ms compute 
>> floor. Am I misunderstanding these changes or how socketOptions is 
>> intended to be used?
>> 
>>  -Jeff
>> 
>>  library(parallel)
>>  library(microbenchmark)
>>  options(socketOptions = "no-delay")
>>  cl <- makeCluster(1)
>>  (x <- microbenchmark(clusterEvalQ(cl, iris), times = 100, unit = 
>> "us"))
>>  # Unit: microseconds
>>  #                   expr  min       lq     mean   median       uq   
>>   max neval
>>  # clusterEvalQ(cl, iris) 96.9 43986.73 40535.93 43999.59 44012.79 
>> 48046.6   100
>> 
>>  > On 11/04/2020 5:41 AM I�aki Ucar <iucar using fedoraproject.org 
>> <mailto:iucar using fedoraproject.org>> wrote:
>>  >
>>  >
>>  > Please, check a tcpdump session on localhost while running the 
>> following script:
>>  >
>>  > library(parallel)
>>  > library(tictoc)
>>  > cl <- makeCluster(1)
>>  > Sys.sleep(1)
>>  >
>>  > for (i in 1:10) {
>>  >   tic()
>>  >   x <- clusterEvalQ(cl, iris)
>>  >   toc()
>>  > }
>>  >
>>  > The initialization phase comprises 7 packets. Then, the 1-second 
>> sleep
>>  > will help you see where the evaluation starts. Each clusterEvalQ
>>  > generates 6 packets:
>>  >
>>  > 1. main -> worker PSH, ACK 1026 bytes
>>  > 2. worker -> main ACK 66 bytes
>>  > 3. worker -> main PSH, ACK 3758 bytes
>>  > 4. main -> worker ACK 66 bytes
>>  > 5. worker -> main PSH, ACK 2484 bytes
>>  > 6. main -> worker ACK 66 bytes
>>  >
>>  > The first two are the command and its ACK, the following are the 
>> data
>>  > back and their ACKs. In the first 4-5 iterations, I see no delay 
>> at
>>  > all. Then, in the following iterations, a 40 ms delay starts to 
>> happen
>>  > between packets 3 and 4, that is: the main process delays the ACK 
>> to
>>  > the first packet of the incoming result.
>>  >
>>  > So I'd say Nagle is hardly to blame for this. It would be 
>> interesting
>>  > to see how many packets are generated with TCP_NODELAY on. If 
>> there
>>  > are still 6 packets, then we are fine. If we suddenly see a 
>> gazillion
>>  > packets, then TCP_NODELAY does more harm than good. On the other 
>> hand,
>>  > TCP_QUICKACK would surely solve the issue without any drawback. As
>>  > Nagle himself put it once, "set TCP_QUICKACK. If you find a case 
>> where
>>  > that makes things worse, let me know."
>>  >
>>  > I�aki
>>  >
>>  > On Wed, 4 Nov 2020 at 04:34, Simon Urbanek 
>> <simon.urbanek using r-project.org <mailto:simon.urbanek using r-project.org>> 
>> wrote:
>>  > >
>>  > > I'm not sure the user would know ;). This is very 
>> system-specific issue just because the Linux network stack behaves 
>> so differently from other OSes (for purely historical reasons). That 
>> makes it hard to abstract as a "feature" for the R sockets that are 
>> supposed to be platform-independent. At least TCP_NODELAY is 
>> actually part of POSIX so it is on better footing, and disabling 
>> delayed ACK is practically only useful to work around the other side 
>> having Nagle on, so I would expect it to be rarely used.
>>  > >
>>  > > This is essentially RFC since we don't have a mechanism for 
>> socket options (well, almost, there is timeout and blocking 
>> already...) and I don't think we want to expose low-level details so 
>> perhaps one idea would be to add something like delay=NA to 
>> socketConnection() in order to not touch (NA), enable (TRUE) or 
>> disable (FALSE) TCP_NODELAY. I wonder if there is any other way we 
>> could infer the intention of the user to try to choose the right 
>> approach...
>>  > >
>>  > > Cheers,
>>  > > Simon
>>  > >
>>  > >
>>  > > > On Nov 3, 2020, at 02:28, Jeff <jeff using vtkellers.com 
>> <mailto:jeff using vtkellers.com>> wrote:
>>  > > >
>>  > > > Could TCP_NODELAY and TCP_QUICKACK be exposed to the R user 
>> so that they might determine what is best for their potentially 
>> latency- or throughput-sensitive application?
>>  > > >
>>  > > > Best,
>>  > > > Jeff
>>  > > >
>>  > > > On Mon, Nov 2, 2020 at 14:05, I�aki Ucar 
>> <iucar using fedoraproject.org <mailto:iucar using fedoraproject.org>> wrote:
>>  > > >> On Mon, 2 Nov 2020 at 02:22, Simon Urbanek 
>> <simon.urbanek using r-project.org <mailto:simon.urbanek using r-project.org>> 
>> wrote:
>>  > > >>> It looks like R sockets on Linux could do with TCP_NODELAY 
>> -- without (status quo):
>>  > > >> How many network packets are generated with and without it? 
>> If there
>>  > > >> are many small writes and thus setting TCP_NODELAY causes 
>> many small
>>  > > >> packets to be sent, it might make more sense to set 
>> TCP_QUICKACK
>>  > > >> instead.
>>  > > >> I�aki
>>  > > >>> Unit: microseconds
>>  > > >>>                    expr      min       lq     mean  median  
>>      uq      max
>>  > > >>>  clusterEvalQ(cl, iris) 1449.997 43991.99 43975.21 43997.1 
>> 44001.91 48027.83
>>  > > >>>  neval
>>  > > >>>   1000
>>  > > >>> exactly the same machine + R but with TCP_NODELAY enabled 
>> in R_SockConnect():
>>  > > >>> Unit: microseconds
>>  > > >>>                    expr     min     lq     mean  median     
>>  uq      max neval
>>  > > >>>  clusterEvalQ(cl, iris) 156.125 166.41 180.8806 170.247 
>> 174.298 5322.234  1000
>>  > > >>> Cheers,
>>  > > >>> Simon
>>  > > >>> > On 2/11/2020, at 3:39 AM, Jeff <jeff using vtkellers.com 
>> <mailto:jeff using vtkellers.com>> wrote:
>>  > > >>> >
>>  > > >>> > I'm exploring latency overhead of parallel PSOCK workers 
>> and noticed that serializing/unserializing data back to the main R 
>> session is significantly slower on Linux than it is on Windows/MacOS 
>> with similar hardware. Is there a reason for this difference and is 
>> there a way to avoid the apparent additional Linux overhead?
>>  > > >>> >
>>  > > >>> > I attempted to isolate the behavior with a test that 
>> simply returns an existing object from the worker back to the main R 
>> session.
>>  > > >>> >
>>  > > >>> > library(parallel)
>>  > > >>> > library(microbenchmark)
>>  > > >>> > gcinfo(TRUE)
>>  > > >>> > cl <- makeCluster(1)
>>  > > >>> > (x <- microbenchmark(clusterEvalQ(cl, iris), times = 
>> 1000, unit = "us"))
>>  > > >>> > plot(x$time, ylab = "microseconds")
>>  > > >>> > head(x$time, n = 10)
>>  > > >>> >
>>  > > >>> > On Windows/MacOS, the test runs in 300-500 microseconds 
>> depending on hardware. A few of the 1000 runs are an order of 
>> magnitude slower but this can probably be attributed to garbage 
>> collection on the worker.
>>  > > >>> >
>>  > > >>> > On Linux, the first 5 or so executions run at comparable 
>> speeds but all subsequent executions are two orders of magnitude 
>> slower (~40 milliseconds).
>>  > > >>> >
>>  > > >>> > I see this behavior across various platforms and hardware 
>> combinations:
>>  > > >>> >
>>  > > >>> > Ubuntu 18.04 (Intel Xeon Platinum 8259CL)
>>  > > >>> > Linux Mint 19.3 (AMD Ryzen 7 1800X)
>>  > > >>> > Linux Mint 20 (AMD Ryzen 7 3700X)
>>  > > >>> > Windows 10 (AMD Ryzen 7 4800H)
>>  > > >>> > MacOS 10.15.7 (Intel Core i7-8850H)
>>  > > >>> >
>>  > > >>> > ______________________________________________
>>  > > >>> > R-devel using r-project.org <mailto:R-devel using r-project.org> 
>> mailing list
>>  > > >>> > <https://stat.ethz.ch/mailman/listinfo/r-devel>
>>  > > >>> >
>>  > > >>> ______________________________________________
>>  > > >>> R-devel using r-project.org <mailto:R-devel using r-project.org> 
>> mailing list
>>  > > >>> <https://stat.ethz.ch/mailman/listinfo/r-devel>
>>  > > >> --
>>  > > >> I�aki �car
>>  > > >
>>  > > > ______________________________________________
>>  > > > R-devel using r-project.org <mailto:R-devel using r-project.org> mailing 
>> list
>>  > > > <https://stat.ethz.ch/mailman/listinfo/r-devel>
>>  > > >
>>  > >
>>  >
>>  >
>>  > --
>>  > I�aki �car
>> 
>>  ______________________________________________
>> R-devel using r-project.org <mailto:R-devel using r-project.org> mailing list
>> <https://stat.ethz.ch/mailman/listinfo/r-devel>

	[[alternative HTML version deleted]]