HPSS migration/staging performance

HPSS migration/staging performance ORNL/NERSC

Dan Million migrated a 1 GB file from ORNL (stingray) to NERSC (raven), and then staged the file back to ORNL. stingray has a GigE interface, and raven 100T. This test was done on August 8, 2001, and the ESnet RTT was about 102 ms. TCP throughput tests between ORNL and NERSC using two GigE workstations (stingray and swift) reached 300 Mbs over the ESnet OC12. The file staging operation from NERSC to ORNL looks good, getting a high percentage of the 100 Mbs raven interface. The transfer reported 8.9 MB/sec (71 Mbs). TCP buffer sizes were set to 2 MB. The following figure shows the observed data rate each second during the transfer. It appears that the application may have "pauses", but the NIC speed mismatch does not cause a problem in this direction.

Figure 1. NERSC to ORNL HPSS file staging data rate (Mbs).

The initial migration from ORNL to NERSC achieved only 813 KB/sec (6.5 Mbs). The TCP window size was only 256KB, limiting bandwidth to roughly 20 Mbs. The following figure shows the sawtooth behavior of the TCP transfer.

Figure 2. ORNL to NERSC HPSS file migration data rate (Mbs).

We collected a tcpdump of this transfer in order to more closely observe the behavior of TCP. A tcptrace analysis of the tcpdump trace reported 2,233 packet retransmissions in 33 recovery events. The following xplot figure shows a closer look at the sawtooth bandwidth.

Figure 3. ORNL to NERSC HPSS file migration, transmitted packet numbers over time.

It appears that the application sends about 30 MB of data between each loss event. The HPSS mover uses 32MB buffers.

The following xplot zooms in on one of the loss events. The white x's are packet transmissions, the purple are SACKs (ACKs indicating loss). There is a pause in the transmissions (200+ ms), then a burst of transmissions just after the 77.1 second mark. Most of the packets in this burst are lost.

Figure 4. ORNL to NERSC HPSS file migration TCP loss event.

It appears the application pauses, and then sends a 200KB burst of data. From the following figure, we can see that the ORNL side pauses awaiting data from the NERSC side at the end of 32 MB block.

Figure 5. ORNL to NERSC HPSS file migration, data from NERSC.

The burst results in a lot of packet losses and retransmissions, It appears that the GigE burst is more than the 100T router can buffer. We saw similar problems with the sender NIC being too fast last year when stingray had a direct GigE connect to the ORNL ESnet router, but the path was restricted by FDDI at NERSC (100 Mbs). Tests to raven from an ORNL 100T host shows better throughput with netperf than from a GigE host. RFC 2861 says TCP should go to slow-start after idle, but AIX will not go to slow-start if the other side sends data (typical of http applications). So AIX blasts a full window of data. However, in this case, even slow start would cause a loss, as there is a loss during slow-start at the start of the session. The application needs to be modified to prevent this pause, so TCP does not lose its "ACK pacing".