We have discovered several bugs in the AIX TCP stack and one in the UDP stack. BUG 1: (fall 2000) AIX stack bos.net.tcp.client.4.3.3.27 The behavior can be consistently reproduced. When configured with sack=1 and tcp_newreno=0, a data transfer from the AIX host (e.g., ttcp, ftp) to a non-sack host over a lossy link will HANG. netstat -a shows tcp4 0 16060 stingray.ccs.orn.36252 cm-208-160-120-1.commp ESTABLISHED tcpdump's at both the receiver and transmitter side shows that a packet drop has occured and the receiver is sending dup ACK's. The AIX transmitter, doesn't do a re-transmit after the 3rd dup ACK, nor does it ever timeout and re-transmit that packet. it just hangs forever! The failure happens everytime there is packet loss, and we demonstrated it to differernt hosts over other lossy links. We also watched with SO_DEBUG and AIX reports: ... 675 ESTABLISHED:input 20d9a54b@11152cb3(win=7c00) -> ESTABLISHED 678 ESTABLISHED:input 20d9a54b@11152cb3(win=7c00) -> ESTABLISHED 980 ESTABLISHED:input 20d9a54b@11152cb3(win=7c00) -> ESTABLISHED 175 ESTABLISHED:user SLOWTIMO -> ESTABLISHED 685 ESTABLISHED:user SLOWTIMO -> ESTABLISHED 195 ESTABLISHED:user SLOWTIMO -> ESTABLISHED .... With sack=0 on the AIX host, the same test proceeds "normally". There are retransmissions after 3rd dup ACK (and some timeouts if multiple drops within a window -- i.e., normal "reno" behavior). Also if the target host is sack-capable, then the transfer follows normal behavior: SACK acks and retransmits and no timeouts. In January, 2001, IBM provided patches to fix the problem. BUG 2: (fall 2000) AIX stack bos.net.tcp.client.4.3.3.27 When configured with tcp_newreno=1 and sack =0, a data transfer from the AIX host (e.g., ttcp, ftp) over a lossy link does not do fast retransmit. This is a more subtle problem, resulting in lower throughput. tcpdump's at both the receiver and transmitter side shows that when a packet drop occurs and the AIX box receives duplicate ACKs, it does NOT do a retransmit after the 3rd dup ACK, rather it eventually times out and retransmits. The behavior is the same if sack=1 and the receiver is not sack-capable. In January, 2001, IBM provided patches to fix the problem. BUG 3 summer, 2001. In our TCP-over-UDP client/server, an AIX client will, in the middle of the UDP flows, send an "ICMP port unreachable", causing the remote server to fail. The AIX client continues to run, doing "timeout re-transmits", and the port is still there. So it is a transient (race?) condition seen so far only on the AIX GigE interface. Actually, the AIX server (probesrv) will also send unexpected ICMP's. The client shouldn't really stop (becasue neither have "connected" UDP ports), but Linux 2.2 get "connection refused" -- fixed in 2.4 kernel.