WAD -- TCP tuning daemon

As part of our Net100 efforts in improving bulk transfers over high speed, high latency networks, we have developed a TCP tuning daemon, WAD (a workaround daemon) based on a Web100 modified Linux kernel. The WAD can auto-tune various TCP parameters of designated network flows. Our hope is to work-around various application, kernel, and protocol bottlenecks by

accelerating data transfers by increasing socket buffer space, disabling delayed ACks, faster slow-start (initial segments, slow-start increment)
avoiding packet loss (modified slow start, reorder threshold, Vegas-like controls, burst reduction)
speeding recovery (virtual MSS, modified AIMD)

Our version 1 WAD, uses a static configuration file. An entry in the WAD config file looks like wad.conf [net100.lbl.gov] src_addr: 0.0.0.0 src_port: 0 dst_addr: 131.243.2.93 dst_port: 0 mode: 1 sndbuf: 4000000 rcvbuf: 4052159 wadai: 6 wadmd: .3 maxssth: 100 divide: 0 floyd: 1 If "mode" is 1, WAD will tune the flow even if the application has done its own setsockopt() on the RCV/SNDBUFs. If mode is 2, the WAD will use NTAF data for the buffer sizes. If "floyd" is 2, the WAD will dynmically update (every 0.1 seconds) the AIMD for the flow using Floyd's AIMD tables. If "floyd" is 1, the WAD will enable the kernel version of Floyd's AIMD tuning (continuous). The wadai fields modifies TCP's additive increase for the flow. The wadmd field modifies TCP's multiplicative decrease. The maxssth enables Floyd's modified slow start (you need the event-driven WAD to tune slow start, polling may be too late.) Here are some early results using WAD to enable Floyd's slow-start for designated flows. If "divide" is 1, the WAD will dynamically reallocate the buffer size among concurrent flows, otherwise each flow always get the full buffer size.

The current version of the WAD can either poll for new connections or the kernel can notify the WAD via a netlink socket that a new connection has been established (see info on event notification). When the WAD identifies a new connection, it checks the configuration file to see if the flow should be tuned.

We are testing the WAD over various high speed, high delay links. Here are some preliminary tuning results using the WAD. The following figure shows the bandwidth for a 10 second netperf transfer from ORNL to PSC.

The receiver at PSC (80 ms RTT, OC12) advertises a 2MB window, and the plot shows the throughput when the transmitter uses 64KB send buffer (typical default) or a WAD-tuned 2 MB buffer. The data for this graph was collected dynamically at the sender from the Web100 MIB variables using ORNL's Web100 tracer daemon, a variation of LBL's Python WAD daemon. WAD/web100 can only tune up to the window-scale factor used by the application in the initial SYN packet. Web100 provides a sysctl variable to set the initial scale factor. We are investigating other TCP parameters that we might "tune", such as a virtual MSS, AIMD parameters, dup threshold, etc. We also have deployed WAD on both ends of the connection, getting 57 Mbs for wad-tuned (1 MB buffer) vs 6 Mbs for a 10 second iperf using 16K default buffers. (Using 1MB buffers on both ends, the iperf gives 81 Mbs. Using 1 MB buffer on iperf server, and letting Linux 2.4 autotune the client achieves 77 Mbs. Linux autotuning will not tune the receiver, so the receiver must advertise a "big enough" window.) We have a auto-tuning summary that describes other approaches to dynamically tuning TCP.

A bigger MSS should help both network and operating system performance. We have modified the Linux kernel so that the WAD can use a ``virtual MSS'' for designated flows. The ``virtual MSS'' is implemented by adding one segment to cwnd a constant K times per RTT during congestion avoidance. The virtual MSS does not cause IP fragmentation or reduce the interrupt overhead. The effect of the virtual MSS is best illustrated when there is packet loss. The following plot illustrates two transfers from ORNL to NERSC with packet loss during slow start. Both flows use the same TCP buffer sizes, but one flow is dynamically tuned by the WAD to use a virtual MSS of 6 segments.

Our WAD can also further improve recovery after a loss, and hence, throughput, by altering TCP's multiplicative decrease. Normally, TCP reduces cwnd by 0.5 after a loss and increases cwnd by 1 segment per round trip time. In the following graph we plot two different tests between ORNL and NERSC, one with standard TCP and the other with WAD tuning the multiplicative decrease to be only 0.3 and the additive increase to be 6. This example also illustrates the typical packet loss during slowstart.

We have recently installed Sally's AIMD mods in the Linux kernel, and our WAD has an option to periodically (every 0.1 seconds) tune AIMD values for designated flows using Sally's table. In the following plot, one can see the slope of the recovery increasing as cwnd increases, and one can see that the multiplicative decrease is no longer 0.5 for the WAD/Floyd tuned flow. A kernel implementation would continuously update the AIMD values. Two tests are illustrated using 2 MB buffers for a 60 second transfer from ORNL to LBNL (OC12, 80 ms RTT). (The better slow-start of the Floyd flow is just the luck of that particular test.)

(Also see our WAD Floyd slow-start results.)

Tierney of LBNL has done more systematic testing in October of 2002 of Floyd's HSTCP. Here are some early results for testing HSTCP in the net100 kernel (2.4.19). These results are averages of 6-30 30 second iperf tests for each path.

The following two graphs show a series of GridFTP tests transferring a 200 MB file from ORNL to LBNL (64K IO buffer, 4MB TCP buffer), for untuned stream, 4 parallel streams, and a WAD tuned AIMD (.125,4) stream. The single stream is configured to perform like the 4 parallel streams, see multcp. (A fully untuned stream, 64K TCP buffer, takes 200 seconds at 8 mbs.) In one series the tuned single stream outperforms that parallel stream. In the second series of tests, they both perform about equally. The tuned stream in the second plot also includes Floyd's modified slow-start (max_ssthresh 100).

We have no conclusive results yet on tuning the buffer sizes of parallel flows. Parallel flows have an advantage in slow-start, because the WAD cannot set the number of segments initially sent at the beginning of slowstart. Though doubling the number of intial segments only reduces the slow-start duration by one RTT. The WAD can tune the slow-start increment to make a more aggressive slow-start, we are still experimenting with this. With an increment of K, slow-start time is reduced by a factor of log(K)/log(2). For other parallel results see here.

For future versions of the WAD, we are considering having the end-point WADs exchange tuning information on an active flow. We are considering a number of additional TCP variables that might be tuned. The current list is as follows:

Work Around Daemon Some Possible WAD algorithms (Mathis 11/27/01) Notes Description DUP threshold (out of order resiliance) intial receive window/scale factor cwnd high limit D ssthresh low limit d Set initial cwnd d Set restart window (e.g. after a timeout) Set initial ssthresh D AIMD constants (also virtual MSS) d Force MTU down (pseudo pacing) K Spin pacing KKn True pacing (open research question) Explicit burst size limits KD Force MTU up (and IP fragmentation) Force client "from" address Force per connection routing (pseudo redirects) Force per connection source route option (LSRR) a Use multiple connections a Negotiate application copy buffer size Set socket "ready" thresholds K Interrupt coalescing parameters K ACK frequency/(long) delayed ack/delayed window parameters d Nagle parameters K In situ path diagnostics to run under any app (diagnostic anti-tune) d=dangerous to the net or other applications D=very dangerous to the net or other applications (All may have adverse affects to the local host or application) K=Requires non-trivial kernel support a=requires WAD to Application API (all other are WAD to kernel) (all assume some kernel support) n=Not really a workaround (e.g. should be in TCP)

For some preliminary studies of the effect of delayed ACKs and altering the reorder threshold, see our TCP-over-UDP page.

Cryptic summary of WAD implementation here 8/11/03.

Links

SC02 paper A TCP Tuning Daemon
our Web100 page
the Net100 project PSC/ORNL/LBL
summary page on our bulk transfer work
info on our almost TCP over UDP test harness
Visit the network performance links page for RFC's and papers.

Last Modified thd@ornl.gov (touches: )
back to Tom Dunigan's page or the Net100 page