Faster Bulk Transfer Starring: *UDP*
Study and Evaluation of 3 UDP protocols
3.2. TSUNAMI
Tsunami is another protocol born out of desperation with TCP. The authors
were working toward the launch of the Global Terabit Research Network.
There was a launch demonstration at a meeting in Brussels and they
wanted to do something flashy and memorable. They had demonstrated wire-rate
gigabit Ethernet transfers in their lab using normal Ethernet MTUs
and were confident they could easily achieve more than 500Mbs. One
PC was shipped to Belgium and one to Seattle. Once they were set
up, the testing began. Because of a 3% packet loss, the rates varied
from a few tens of Mbs to a very few hundreds of Mbs. Less than one
week before the demo, the Lab decided they were going to have to
design their own protocol. Less than 3 days after a white board
diagram, there was a working prototype and a few days later the
demo managed to average over 800Mbs for 17 hours and 40 minutes.
From the general comments and the amount of time involved, this was
probably a UDP blast with a minimum of other features.
Tsunami is evolving, however, to include some features intended to
make it more network-friendly such as:
- A measured inter-datagram/block delay has been incorporated into
the Tsunami sender
- Improvements have been made to the user interface allowing the
user to regulate up to 12 parameters including sending rate and
maximum tolerated error rate
- The REQUEST_ERROR_RATE packet contains an error rate for the
sender to use in adjusting the IBD(inter-block delay) during the
transfer
- MD5-based authentication has been added
- There is a plan for UDP packets to be marked for "Scavenger" or
less than best effort service for networks supporting class of service
Tsunami is an application library implemented in (well commented!) C.
1. use of TCP/UDP channel(s)
Tsunami uses a TCP channel for control packets and a UDP channel for
data packets sent from server to client. SO_SNDBUF
for the sender and SO_RCVBUF for the receiver are set for UDP with
setsockopt(). TCP_NODELAY and SO_REUSEADDR are set on both sides
for TCP.
tsunamid listens on a TCP port for client requests
and, upon receiving one, forks a child process to deal with the
connection. Command line options(shown with the default) for
tsunamid include: verbose(yes),
transcript(no), ipv6(no), tcp port(46224), shared secret, size of
datagram/block in bytes(32768), udp buffer size(20MB).
tsunami, the client, connects to the server on the TCP
port(46224). The client calls fdopen() to convert its TCP
channel to a standard I/O stream and uses the standard fread
and fwrite calls to send/receive control information. The server
uses fcntl() to make its TCP channel blocking while the up-front
negotiations for file transfer are going on and then non-blocking so
it can be checked after the transfer of each block without stopping
the UDP blast unless there is a control message from the client.
After a TCP connection is established, the client acts on commands
typed in at the prompt communicating with the server the needed
information.
The Tsunami user interface seems to be modeled on ftp with the
possible client commands being: connect, get, close, help, quit, and set.
The "set" command gives an opportunity to affect 12 parameters
(shown with the default values)--
server = localhost
port = 46224
buffer = 20000000
verbose = yes
transcript = no
ip = v4
output = line
rate = 1000000000
error = 7.50%
slowdown = 25/24
speedup = 5/6
history = 25%
datagram = 32768
Upon receiving a "get ", the client obtains a UDP socket and
sends the port number to the sender. Once the file is verified,
all data is sent thru the UDP socket from sender to client.
The UDP checksum is used to insure a block is transferred correctly.
2. rate-control algorithm
The beginning rate is set to the DEFAULT_TARGET_RATE(1000000000)--unless
changed by the user with the SET command--
and a timed select() call is used to implement the
IBD(inter-block delay). The beginning IBD is calculated as:
param->ipd_time = (u_int32_t) ((1000000LL * 8 * param->block_size) / param->target_rate);
xfer->ipd_current = param->ipd_time * 3;
When a REQUEST_ERROR_RATE message is received by the sender, the IBD
is recalculated using the new error rate. If the new rate is greater
than the accepted maximum, increase the delay--slowing things down.
Otherwise decrease the delay--speeding things up
if (retransmission->error_rate > param->error_rate) {
double factor1 = (1.0 * param->slower_num / param->slower_den) - 1.0;
double factor2 = (1.0 + retransmission->error_rate - param->error_rate)
/ (100000.0 - param->error_rate);
xfer->ipd_current *= 1.0 + (factor1 * factor2);
} else
xfer->ipd_current = (u_int32_t) (xfer->ipd_current *
(u_int64_t) param->faster_num / param->faster_den);
/* make sure the IBD is still in range */
xfer->ipd_current = max(min(xfer->ipd_current, 10000), param->ipd_time);
Block size is a key parameter since block_size datagrams are handed
off to UDP then to IP. The default block size of 32K means IP will
fragment a block into 23 or so packets and send them out
before implementing any delay. The IBD is
implemented between blocks, not between packets unless a block equal
a packet in size.
Obviously, a smaller block size will mean
that the sending rate would be adjusted more often keeping
the transfer more attuned to the network and the IBD would come closer
to being an IPD(inter-packet delay). On the other hand,
larger block sizes are more efficient when it comes to file I/O.
3. data sending algorithm
The sender's main functions are to build and send block-sized
datagrams and process control packets from the receiver. Data
is read directly from the file in block-sized segments into a buffer.
A block number and type are attached so the only thing to keep
straight is where the file read should begin for the next block.
The sending algorithm is:
- initialize
- get current time
- check the non-blocking tcp channel. If there is control data,
read and process the control packet
- REQUEST_RETRANSMIT -- Retransmit the given block then go to step 5
- REQUEST_RESTART -- Restart the transfer at the given block then
go to step 5
- REQUEST_ERROR_RATE -- Use the given error rate to adjust the IPD,
update and print statistics then go to step 5
- REQUEST_STOP - go to step 6
- build the next new datagram and send it
- delay until time to send the next block then go to step 2
- do ending stats and close down
4. data receiving algorithm
Until the transfer is complete, the client receives the datagrams
and periodically sends control packets to the sender. A thread is
created to handle disk I/O.
- initialize: allocate a retransmit table, received-data bitfield,
and a ring_buffer; create a thread to do I/O.
- start timing
- reserve a slot in the ring buffer for the next datagram
- receive a datagram into the reserved slot
- signal the I/O thread that data is ready
- if the block number is greater than the expected block number, put
the missing block numbers into the retransmit table
- if this is the last block
- if we have gotten all the blocks, go to step 10
- send a REQUEST_RETRANSMIT packet for any missing blocks
- if we have received a multiple of 50 blocks and a preset interval
has passed since the last time statistics were updated:
- if the retransmit queue is over MAX_RETRANSMISSION_BUFFER entries,
send a REQUEST_RESTART packet beginning at the first missing block in
the queue
- otherwise send a REQUEST_RETRANSMIT packet for each missing block
- update rate statistics and send along the smoothed retransmission rate
in a REQUEST_ERROR_RATE packet
- display the latest statistics information
- reset statistics timer
- if there is more data, go to step 3
- call pthread_join for the disk I/O thread
- send a REQUEST_STOP packet to the server
- stop timing, display final results and close up
Graphs illustrating a representative transfer over NistNet using tsunami
may be found here.
5. unique features
***Authentication
TSUNAMI implements an authentication process between the server and
client via a shared secret.
***Accounting for data via blocks
As mentioned, Data is read, written, transferred and accounted for in
block-sized chunks.
A block of data is handed off to UDP and thus IP may need to fragment
the data at the sender and reassemble at the receiver.
When block sizes are large, retransmissions could be a BIG deal and
a REQUEST_RESTART could be a REALLY BIG deal! We discovered that
using the command line datagram parameter at the sender left it at
32K. We needed to alter the receiver code to accept a datagram/block
parameter in order to experiment with different block sizes.
***The file is written out as blocks are received
In tests, this meant that the last block might be written out and
so the file would list the correct number of bytes with 'ls -l'.
However, there would be blocks missing if the file transfer had not
completed. This happened at times when REQUEST_RESTART did not
function as designed and both client and server had to be manually
stopped.
***Number of user-set parameters
The tsunami client allows the user to set and tune up to 12
different parameters as noted above. This includes performance
specific parameters such as buffer size, target rate, expected error
rate, slowdown and speedup factors and the percentage of history
used in the rate calculation.