[Rd] parallel PSOCK connection latency is greater on Linux?

Gabriel Becker g@bembecker @end|ng |rom gm@||@com
Tue Nov 2 03:07:56 CET 2021


Hi all,

Please disregard my previous email as I misread the pasted output. Sorry
for the noise.

Best,
~G

On Mon, Nov 1, 2021 at 6:45 PM Jeff <jeff using vtkellers.com> wrote:

> Hi Gabriel,
>
> Yes, 40 milliseconds (ms) == 40,000 microseconds (us). My benchmarking
> output is reporting the latter, which is considerably higher than the 40us
> you are seeing. If I benchmark just the serialization round trip as you
> did, I get comparable results: 14us median on my Linux system. So at least
> on Linux, there is something else contributing the remaining 39,986us. The
> conclusion from earlier in this thread was that the culprit was TCP
> behavior unique to the Linux network stack.
>
> Jeff
>
> On Mon, Nov 1 2021 at 05:55:45 PM -0700, Gabriel Becker <
> gabembecker using gmail.com> wrote:
>
> Jeff,
>
> Perhaps I'm just missing something here, but ms is generally milliseconds,
> not microseconds (which are much smaller), right?
>
> Also, this seems to just be how long it takes to roundtrip serialize iris
> (in 4.1.0  on mac osx, as thats what I have handy right this moment):
>
> > microbenchmark({x <- unserialize(serialize(iris, connection = NULL))})
>
> Unit: microseconds
>
>                                                          expr    min
> lq
>
>  {     x <- unserialize(serialize(iris, connection = NULL)) } 35.378
> 36.0085
>
>      mean  median     uq   max neval
>
>  40.26888 36.4345 43.641 80.39   100
>
>
>
> > res <- system.time(replicate(10000, {x <- unserialize(serialize(iris,
> connection = NULL))}))
>
> > res/10000
>
>     user   system  elapsed
>
> 4.58e-05 2.90e-06 4.88e-05
>
>
> Thus the overhead appears to be extremely minimal in your results above,
> right? In fact it seems to be comparable or lower than replicate.
>
> ~G
>
>
>
>
>
> On Mon, Nov 1, 2021 at 5:20 PM Jeff Keller <jeff using vtkellers.com> wrote:
>
>> Hi Simon,
>>
>> I see there may have been some changes to address the TCP_NODELAY issue
>> on Linux in
>> https://github.com/wch/r-source/commit/82369f73fc297981e64cac8c9a696d05116f0797
>> .
>>
>> I gave this a try with R 4.1.1, but I still see a 40ms compute floor. Am
>> I misunderstanding these changes or how socketOptions is intended to be
>> used?
>>
>> -Jeff
>>
>> library(parallel)
>> library(microbenchmark)
>> options(socketOptions = "no-delay")
>> cl <- makeCluster(1)
>> (x <- microbenchmark(clusterEvalQ(cl, iris), times = 100, unit = "us"))
>> # Unit: microseconds
>> #                   expr  min       lq     mean   median       uq     max
>> neval
>> # clusterEvalQ(cl, iris) 96.9 43986.73 40535.93 43999.59 44012.79
>> 48046.6   100
>>
>> > On 11/04/2020 5:41 AM Iñaki Ucar <iucar using fedoraproject.org> wrote:
>> >
>> >
>> > Please, check a tcpdump session on localhost while running the
>> following script:
>> >
>> > library(parallel)
>> > library(tictoc)
>> > cl <- makeCluster(1)
>> > Sys.sleep(1)
>> >
>> > for (i in 1:10) {
>> >   tic()
>> >   x <- clusterEvalQ(cl, iris)
>> >   toc()
>> > }
>> >
>> > The initialization phase comprises 7 packets. Then, the 1-second sleep
>> > will help you see where the evaluation starts. Each clusterEvalQ
>> > generates 6 packets:
>> >
>> > 1. main -> worker PSH, ACK 1026 bytes
>> > 2. worker -> main ACK 66 bytes
>> > 3. worker -> main PSH, ACK 3758 bytes
>> > 4. main -> worker ACK 66 bytes
>> > 5. worker -> main PSH, ACK 2484 bytes
>> > 6. main -> worker ACK 66 bytes
>> >
>> > The first two are the command and its ACK, the following are the data
>> > back and their ACKs. In the first 4-5 iterations, I see no delay at
>> > all. Then, in the following iterations, a 40 ms delay starts to happen
>> > between packets 3 and 4, that is: the main process delays the ACK to
>> > the first packet of the incoming result.
>> >
>> > So I'd say Nagle is hardly to blame for this. It would be interesting
>> > to see how many packets are generated with TCP_NODELAY on. If there
>> > are still 6 packets, then we are fine. If we suddenly see a gazillion
>> > packets, then TCP_NODELAY does more harm than good. On the other hand,
>> > TCP_QUICKACK would surely solve the issue without any drawback. As
>> > Nagle himself put it once, "set TCP_QUICKACK. If you find a case where
>> > that makes things worse, let me know."
>> >
>> > Iñaki
>> >
>> > On Wed, 4 Nov 2020 at 04:34, Simon Urbanek <simon.urbanek using r-project.org>
>> wrote:
>> > >
>> > > I'm not sure the user would know ;). This is very system-specific
>> issue just because the Linux network stack behaves so differently from
>> other OSes (for purely historical reasons). That makes it hard to abstract
>> as a "feature" for the R sockets that are supposed to be
>> platform-independent. At least TCP_NODELAY is actually part of POSIX so it
>> is on better footing, and disabling delayed ACK is practically only useful
>> to work around the other side having Nagle on, so I would expect it to be
>> rarely used.
>> > >
>> > > This is essentially RFC since we don't have a mechanism for socket
>> options (well, almost, there is timeout and blocking already...) and I
>> don't think we want to expose low-level details so perhaps one idea would
>> be to add something like delay=NA to socketConnection() in order to not
>> touch (NA), enable (TRUE) or disable (FALSE) TCP_NODELAY. I wonder if there
>> is any other way we could infer the intention of the user to try to choose
>> the right approach...
>> > >
>> > > Cheers,
>> > > Simon
>> > >
>> > >
>> > > > On Nov 3, 2020, at 02:28, Jeff <jeff using vtkellers.com> wrote:
>> > > >
>> > > > Could TCP_NODELAY and TCP_QUICKACK be exposed to the R user so that
>> they might determine what is best for their potentially latency- or
>> throughput-sensitive application?
>> > > >
>> > > > Best,
>> > > > Jeff
>> > > >
>> > > > On Mon, Nov 2, 2020 at 14:05, Iñaki Ucar <iucar using fedoraproject.org>
>> wrote:
>> > > >> On Mon, 2 Nov 2020 at 02:22, Simon Urbanek <
>> simon.urbanek using r-project.org> wrote:
>> > > >>> It looks like R sockets on Linux could do with TCP_NODELAY --
>> without (status quo):
>> > > >> How many network packets are generated with and without it? If
>> there
>> > > >> are many small writes and thus setting TCP_NODELAY causes many
>> small
>> > > >> packets to be sent, it might make more sense to set TCP_QUICKACK
>> > > >> instead.
>> > > >> Iñaki
>> > > >>> Unit: microseconds
>> > > >>>                    expr      min       lq     mean  median
>>  uq      max
>> > > >>>  clusterEvalQ(cl, iris) 1449.997 43991.99 43975.21 43997.1
>> 44001.91 48027.83
>> > > >>>  neval
>> > > >>>   1000
>> > > >>> exactly the same machine + R but with TCP_NODELAY enabled in
>> R_SockConnect():
>> > > >>> Unit: microseconds
>> > > >>>                    expr     min     lq     mean  median      uq
>>     max neval
>> > > >>>  clusterEvalQ(cl, iris) 156.125 166.41 180.8806 170.247 174.298
>> 5322.234  1000
>> > > >>> Cheers,
>> > > >>> Simon
>> > > >>> > On 2/11/2020, at 3:39 AM, Jeff <jeff using vtkellers.com> wrote:
>> > > >>> >
>> > > >>> > I'm exploring latency overhead of parallel PSOCK workers and
>> noticed that serializing/unserializing data back to the main R session is
>> significantly slower on Linux than it is on Windows/MacOS with similar
>> hardware. Is there a reason for this difference and is there a way to avoid
>> the apparent additional Linux overhead?
>> > > >>> >
>> > > >>> > I attempted to isolate the behavior with a test that simply
>> returns an existing object from the worker back to the main R session.
>> > > >>> >
>> > > >>> > library(parallel)
>> > > >>> > library(microbenchmark)
>> > > >>> > gcinfo(TRUE)
>> > > >>> > cl <- makeCluster(1)
>> > > >>> > (x <- microbenchmark(clusterEvalQ(cl, iris), times = 1000, unit
>> = "us"))
>> > > >>> > plot(x$time, ylab = "microseconds")
>> > > >>> > head(x$time, n = 10)
>> > > >>> >
>> > > >>> > On Windows/MacOS, the test runs in 300-500 microseconds
>> depending on hardware. A few of the 1000 runs are an order of magnitude
>> slower but this can probably be attributed to garbage collection on the
>> worker.
>> > > >>> >
>> > > >>> > On Linux, the first 5 or so executions run at comparable speeds
>> but all subsequent executions are two orders of magnitude slower (~40
>> milliseconds).
>> > > >>> >
>> > > >>> > I see this behavior across various platforms and hardware
>> combinations:
>> > > >>> >
>> > > >>> > Ubuntu 18.04 (Intel Xeon Platinum 8259CL)
>> > > >>> > Linux Mint 19.3 (AMD Ryzen 7 1800X)
>> > > >>> > Linux Mint 20 (AMD Ryzen 7 3700X)
>> > > >>> > Windows 10 (AMD Ryzen 7 4800H)
>> > > >>> > MacOS 10.15.7 (Intel Core i7-8850H)
>> > > >>> >
>> > > >>> > ______________________________________________
>> > > >>> > R-devel using r-project.org mailing list
>> > > >>> > https://stat.ethz.ch/mailman/listinfo/r-devel
>> > > >>> >
>> > > >>> ______________________________________________
>> > > >>> R-devel using r-project.org mailing list
>> > > >>> https://stat.ethz.ch/mailman/listinfo/r-devel
>> > > >> --
>> > > >> Iñaki Úcar
>> > > >
>> > > > ______________________________________________
>> > > > R-devel using r-project.org mailing list
>> > > > https://stat.ethz.ch/mailman/listinfo/r-devel
>> > > >
>> > >
>> >
>> >
>> > --
>> > Iñaki Úcar
>>
>> ______________________________________________
>> R-devel using r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>

	[[alternative HTML version deleted]]



More information about the R-devel mailing list