Floyd's TCP slow-start and AIMD mods

Sally Floyd, as part of her research on high speed TCP, has suggested modifications to TCP slow start and to TCP's AIMD for high delay, bandwidth networks. As part of our Net100 research, we have modified the Linux 2.4.16 Web100 kernel with Sally's slow-start mods. The following graph shows two TCP transfers from ORNL to NERSC (80ms RTT) between two Linux boxes. The NERSC box (receiver) is advertising a window of 2MB. The ORNL sender uses a window of 1 MB. One test was done with no tuning. The other test used our Net100 tuning daemon (WAD) to enable Sally's slow start mods for the flow. The test using standard slow-start experienced an early loss event in slow start. The flow using Sally's slow-start made it much further, but still had a loss in the modified slow start (plus some other losses and timeout -- such is testing on the real internet....). The WAD-tuned additive-increase is 6x . The cwnd data was collected by a web100 tracer daemon (e.g., kernel values of cwnd, not from tcpdump/tcptrace).

The following graph shows a WAD-tuned TCP iperf transfer from ORNL to LBNL (80 mss RTT) using Sally's slow-start with two different settings for max_ssthresh.

We also added Sally's slow start and her AIMD mods to our TCP-over-UDP test harness atou. The following graph compares slow-start for four TCP-over-UDP transfers from ORNL to NERSC (OC12, RTT 80 ms) with a max window size of 1000 segments. The graph illustrates the stretching of slow-start with Floyd's max_ssthresh set to 100 both with and without delayed ACKs.

For comparison, the following plot shows an ns-2 simulation of Floyd's slow-start and standard slow-start for a simulated network path like the ORNL-NERSC path (300Mbs, 80 ms RTT, max window 1000, delayed ACKs).

In another test from PSC to LBL, we were getting losses in slow-start about 0.7 seconds into the transfer and the throughput was only 98 Mbs for the 10 second transfer (2000 segment window, 75 ms RTT, no delayed ACK). The congestion window at the time of the loss was 367 segments. Using Floyd's modified slow-start we eliminated losses and got 192 Mbs (max_ssthresh=100) and 204 Mbs (max_ssthresh=200).

The following image shows the stretched slow-start for a TCP-over-UDP transfer from ORNL to CERN (RTT 190 ms) with max window set to 1000 segments and the server is not doing delayed ACKs.

The following image compares three TCP transfers from ORNL to CERN. One using 4 parallel streams, the other two are WAD-tuned with AIMD (0.125,4) to try and approximate the parallel stream peformance. One stream uses Floyd's modified slow start, and in this particular test, attains only 29 Mbs compared with 38 Mbs for the other stream. The aggregate parallel stream achieved 82 Mbs.

We have experimented using the Vegas gamma threshold during slow-start to initiate Floyd's slow start (rather than max_ssthresh). The figure below shows two transfers using a Vegas implementation over our TCP-over-UDP from PSC to ORNL (OC12, 77ms RTT). The standard Vegas exits slow-start if the Vegas delta is larger than gamma and enters congestion avoidance phase. The standard Vegas (actually using the Floyd AIMD mods) slowly increases the window in congestion avoidance phase, further dampened by the Vegas alpha/beta (1/3 in this test) and only reaches 36 Mbs. The second transfer switches to Floyd's slow-start when gamma is exceeded. Both transfers exceed gamma at about the 0.5 seconds into the transfer. The Floyd version climbs quickly to the limiting window of 2000 segments and reaches an average throughput of 232 Mbs -- nearly a factor of 8 improvement. Interestingly, for these tests, the cwnd value when gamma (1) is exceeded is around 120 segments, close to Floyd's suggested max_ssthresh of 100.

The figure below illustrates three different TCP-over-UDP transfers from PSC to ORNL. The path was exhibiting packet drops during slow-start. The standard slow-start algorithm suffers a loss and recovers using Floyd's AIMD and reaches an average throughput of 198 Mbs after 10 seconds. By comparison, the Vegas-controlled transfers exit standard slow-start when gamma is exceeded and switch to Floyd's modified slow-start and have no packet losses. The Vegas flow with a gamma of 2 reaches 219 Mbs, the flow with a gamma of 60 reaches 229 Mbs.

Floyd's modified AIMD

We have installed Sally's AIMD mods in the Linux kernel, and our WAD has an option to periodically (every 0.1 seconds) tune AIMD values for designated flows using Sally's table. The following plot compares two transfers from ORNL to CERN (160 ms RTT). One transfer uses standard TCP AIMD (0.5,1), and the second uses the kernel version of Floyd's modified AIMD. The effect of the altered decrement and increment is apparent.

As another example, the following plot compares three data transfers from ORNL to NERSC (OC12, 80 ms RTT). Two transfers use standard TCP AIMD (0.5,1), slow-start is prematurely terminated in one flow by a Linux "sendstall" -- NIC send buffer overrun. In the second flow, we have used the WAD to disable sendstall's. The third flow uses Floyd's modified AIMD. (As always with tests on the "live" Internet, results vary.)

Tierney of LBNL has done more systematic testing in October of 2002 of Floyd's HSTCP. Here are some early results for testing HSTCP in the net100 kernel (2.4.19). These results are averages of 6-30 30 second iperf tests for each path.

In the following plot, we have used the WAD at 0.1 second intervals to tune the flows. (The kernel implementation continuously updates the AIMD values.) One can see the slope of the recovery increasing as cwnd increases, and one can see that the multiplicative decrease is no longer 0.5 for the WAD/Floyd tuned flow. Two tests are illustrated using 2 MB buffers for a 60 second transfer from ORNL to LBNL (OC12, 80 ms RTT). (The better slow-start of the Floyd flow is just the luck of that particular test.)

The following graphs shows a TCP-over-UDP transfer from ORNL to NERSC (OC12, RTT 87 ms) with a max window size of 1000 segments. This path often experiences loss during slow start. The reciever is not doing delayed ACKs. Floyd's max_ssthresh is set to 100 segments and that stretches the slow start by about 1.8 seconds. There is a single loss at 4.73 seconds. Floyd's modified AIMD for a cwnd of 1000 segments is to reduce cwnd by 0.329878 (rather than 0.5 for standard TCP) and to do linear recovery at 7 segments per RTT increasing 8 segments per RTT (rather than 1 per RTT for standard TCP). The resulting bandwidth for this 10 second transfer is 100 Mbs. We also plot a line indicating what the cwnd recovery would have looked like for standard TCP. For this path, standard TCP would have taken over 45 seconds to reopen cwnd to 1000 segments. Additional plots related to this transfer are here.

For comparison, we set up a simulation on ns-2 to compare Floyd's AIMD and the standard AIMD. (We modified ns to use the modified multiplier on the first loss.) The ns-2 version uses a low-window of 31 (rather than the 38 in Sally's draft) and for 1000 segment cwnd, the the decrement is 0.323949 and the additive increase is 8.6. The following graph simulates the ORNL-NERSC path with a single drop and no delayed ACKs.

Our TCP tuning daemon, WAD, also does AIMD pre-tuning. The WAD can be preconfigured with AIMD values for a path, or can use Floyd's equations to calculate an AIMD based on the buffer sizes configured for the path. The AIMD modifications are done just once, as compared with Floyd's continuous update model. The following cwnd graph of an ns-2 simulation shows a packet loss early in slow start for a 500 Mbs path with 80 ms RTT with maxiumum window set at 3500 segments. Standard AIMD is compared with Floyd's AIMD and a static pre-tuned AIMD (0.3,15). Also compared are the algorithms with and without delayed ACKs. Experiencing an early loss on this high delay, high bandwidth path results in terrible performance for standard TCP AIMD. This is a worst-case scenario, but packet loss in initial slow start is all too common.

The number of seconds before the transmission rate reaches full bandwidth after the early packet loss is summarized in the following table. Using jumbo frames, 9000 byte MTU rather than 1500, performance is improved by a factor of 6 for all algorithms.

Algorithm Time to full BW (seconds) standard TCP 284 standard TCP (del ACK) 577 (nearly 10 minutes !) standard TCP 47 jumbo frames standard TCP (del ACK) 91 jumbo frames Floyd AIMD 35 Floyd AIMD (del ACK) 71 tuned AIMD 19 tuned AIMD (del ACK) 38

Tom Kelly's scalable TCP uses a fixed AI recovery method with an MD of .125, the following plot compares the performance of Kelly's TCP with Floyd and standard TCP. The test is from California to Cern. Each flow was run independently with a UDP blast injected at around 3 seconds into each flow.

Last Modified thd@ornl.gov (touches: )
back to Tom Dunigan's page or the net100 page.