Dunigan's Cray XT3 Testing

ORNL Cray XT3 Evaluation

.... this is work in progress.... last revised

January, 2005. ORNL is evaluating the Cray XT3 (RedStorm) an opteron-based cluster interconnected with HyperTransport and Cray's SeaStar interconnect chip.

Revision history:

4/1/05 HALO shmem mpi updates 3/17/05 hpl on 16 processors, some hpcc tests 3/8/05 HALO mpi plots (latency still 28 us) barrier, broadcast, allreduce, parbench comms1 works now 2/24/05 faster streams numbers with AMD tips for pgf77 pathscale 1/24/05 NPB mpi CG FT 1/22/05 SHMEM 1/21/05 exchange 1/20/05 testing with alpha software, buggy, latency 32 us, bw 1.1 GBs alpha software may be using IP for message passing! single cpu numbers: dgemm, euroben, streams 1/4/05 1 cabinet XT3 to ORNL Test environment Package/Version Build Date ------------------ ---------- xt-boot-1.0-77 20041204 xt-libc-1.0-58 20041204 xt-mpt-1.0-76 20041204 xt-service-1.0-65 20041204 xt-prgenv-1.0-61 20041208 xt-pe-1.0-64 20041204 xt-os-1.0-72 20041204 xt-tests-0.0-57 20041204 xt-libsci-1.0-56 20041204 xt-catamount-1.15-68 20041204 pgi-5.2.4-0 20041005 acml-2.1-1 20041007

The results presented here are from standard benchmarks and some custom benchmarks and, as such, represent only one part of the evaluation. An IBM SP4, IBM Winterhawk II (noted as SP3 in the following tables and graphs),Cray X1, Cray XD1, SGI Altix (Itanium 2), an Opteron cluster using Quadrix 3, and Compaq Alpha ES40 at ORNL were used for comparison with the XT3. The results below are in the following categories:

architecture -- configuration summaries
benchmarks -- benchmark descriptions
memory performance -- memory performance
low-level results -- base CPU and intrinsics results
shared-memory results -- single node/thread performance
message passing results -- latency/bandwith, collectives
kernels -- application kernel results

ARCHITECTURE

The initial XT3 configuration at ORNL consists of a single cabinet with 80 processors.... The ORNL XD1 test cluster consists of 34 2-way SMP Opteron with 4 GB of memory (2.2 GHz). The X1 at ORNL has 8 nodes. Each node has 4 MSPs, each MSP has 4 SSPs, and each SSP has two vector units. The Cray "processor/CPU" in the results below is one MSP. All 4 MSP's on a node share memory. The Power4 consists of one node with 32 processors sharing memory. Both the Alpha and SP3 consist of four processors sharing memory on a single node. All memory is accessible on the Altix. The following table summarizes the main characteristics of the machines

Specs: Alpha SC SP3 SP4 X1 XD1 XT3 Altix MHz 667 375 1300 800 2200 2400 1500 memory/node 2GB 2GB 32GB 16GB 4GB 1GB 512GB L1 64K 64K 32K 16K 64k 64K 32K L2 8MB 8MB 1.5MB 2MB 1MB 1MB 256K L3 128MB 6MB peak Mflops 2*MHz 4*MHz 4*MHz 12.8 2*MHz 2*MHz 4*MHz peak mem BW 5.2GBs 1.6GBs 200+ GBs ?200+GBS 6.4GBs 6.4 GBs 6.4 GBs alpha 2 buses @ 2.6 GBs each X1 memory bandwidth is 34 GB/s/CPU. For the Alpha, nodes are interconnected with a Quadrics switch organized as a fat tree. The SP nodes are interconnected with cross-bar switches in an Omega-like network. The X1 uses a modified 2-D torus and the XT3 uses a 3-D torus.

BENCHMARKS

We have used widely available benchmarks in combination with our own custom benchmarks to characterize the performance of the X1. Some of the older benchmarks may need to be modified for these newer faster machines -- increasing repetitions to avoid 0 elapsed times, increasing problem sizes to test out of cache performance. Unless otherwise noted, the following compiler switches were used on the Alpha and SP.

XD1: -O3 (pgf90 v5.2-2)) X1: -Oaggress,stream2 (arpun -n xxx -p 64k:16m a.out) Alpha: -O4 -fast -arch ev6 SP: -O4 -qarch=auto -qtune=auto -qcache=auto -bmaxdata:0x70000000 Benchmarks were in C, FORTRAN, and FORTRAN90/OpenMP. We also compared performance with the vendor runtime libraries, sci(X1), cxml (Alpha), acml (AMD opteron), and essl (SP). We used the following benchmarks in our tests:

ParkBench 2.1 -- provides low-level sequential and communication benchmarks, parallel linear algebra benchmarks, NAS parallel benchmarks, and compact application codes. Here is a summary of the benchmark modules. Codes are in FORTRAN. Results are often reported as least-squares fit of data. We report actual performance numbers.
EuroBen 3.9 -- provides serial benchmarks for low-level performance and applicaton kernels (linear algebra, eigen value, FFT, QR). Here is a summary of the benchmark modules. euroben-dm provides some communication and parallel (MPI) benchmarks. The web site includes results from other systems.
lmbench -- provides insight into OS (UNIX) performance and memory latencies. The web site includes results from other systems.
stream -- measures memory bandwidth for both serial and parallel configurations. Also we use the MAPS memory benchmark. The web sites include results from other systems.
Custom low-level benchmarks that we have used over the years in evaluating memory and communication performance.

For both the Alpha and the SP, gettimeofday() provides microsecond wall-clock time (though one has to be sure MICROTIME option is set in the Alpha OS kernel). Both have high-resolution cylce counters as well, but the Alpha cycle counter is only 32-bits so rolls over in less than 7 seconds. For distributed benchmarks (MPI), the IBM and Alpha systems provide a hardware synchronized MPI_Wtime() with microsecond resolution.

MEMORY PERFORMANCE

The stream benchmark is a program that measures main memory throughput for several simple operations. The aggregate data rate for multiple threads is reported in the following table. Recall, that the "peak" memory data rate for the X1 is 200 GBs, Alpha is 5.2 GBs, and for the SP3 is 1.6 GBs. Data for the 16-way SP3 (375 Mhz, Nighthawk II) is included too. Data for the Alpha ES45 (1 GHz) is obtained from the streams data base. Data for p690/sp4 is with affinity enabled (6/1/02). The X1 uses (aprun -A). The Opteron is supposed to have greater than 6.4 GB/s/cpu memory bandwidth, we don't see that yet with stream_d.c? Revised 2/24/05

MBs copy scale add triad xt3 2639 2527 2652 2788 pgf77 just O3 stream_d.f xt3 3930 3791 4211 4308 pgf77 -O3 -mp -Mnontemporal -fastsse -Munsafe_par_align xt3 3459 3747 3968 3968 reported by Sandia 2/23/05 xt3 4924 4928 4868 4830 reported by AMD 2.4 GHz 2/8/05 pathscale xt3 4935 4934 5053 5069 pathscale -O3 -LNO:prefetch=2 -LNO:prefetch_ahead=9 -LNO:fusion=2 -CG:use_prefetchnta=on xd1 2427 2369 2577 2598 pgcc -O3 pgi/5.2-4 xd1 3134 3169 3213 3217 PathScale compiler (from STREAMS site) xd1 4098 4091 4221 4049 pathscale (see above switches) altix 3214 3169 3800 3809 X1 22111 21634 23658 23752 alpha1 1339 1265 1273 1383 es45-1 1946 1941 1978 1978 SP3 1 523 561 581 583 SP3/16-1 486 494 601 601 SP4-1 1774 1860 2098 2119 p690 p655 3649 3767 2899 2913 ibm p655 11/3/04 power5 5356 5138 4000 4039 ibm power5 1.656 GHz 12/15/04 From AMD's published spec benchmark and McCalpin's suggested conversion of 171.swim results to triad memory bandwidth, we get 2.7 GB/s memory bandwidth for one Opteron processor.

The MAPS benchmark also characterizes memory access performance. Plotted are load/store bandwidth for sequential (stride 1) and random access. Load is calculated from s=s+x(i)*y(i) and store from x(i)= s. Revised 12/2/04

The hint benchmark measures computation and memory efficiency as the problem size increases. (This is C hint version 1, 1994.) The following graph shows the performance of a single processor for the XD1 (164 MQUIPS), X1 (12.2 MQUIPS), Alpha (66.9 MQUIPS), Altix (88.2 MQUIPS), and SP4 (74.9 MQUIPS). The L1 and L2 cache boundaries are visible, as well as the Altix and SP4's L3. Revised 1/20/05

Here results from LMbench, a test of various OS services. We do not have data for the XT3 as it runs a micro-kernel on the compute nodes.

LOW LEVEL BENCHMARKS (single processor)

The following table compares the performance of the X1, Alpha, and SP for basic CPU operations. These numbers are the average Mflop/s from the first few kernels of EuroBen's mod1ac. The 14th kernel (9th degree poly) is a rough estimate of peak FORTRAN performance since it has a high re-use of operands. (Revised 1/20/05)

alpha xt3 sp4 X1 xd1 Altix broadcast 516 1186 1946 2483 1100 2553 copy 324 795 991 2101 733 1758 addition 285 794 942 1957 725 1271 subtraction 288 794 968 1946 720 1307 multiply 287 794 935 2041 726 1310 division 55 299 90 608 275 213 dotproduct 609 796 2059 3459 730 724 X=X+aY 526 1188 1622 4134 1088 2707 Z=X+aY 477 1192 1938 3833 1092 2632 y=x1x2+x3x4 433 1772 2215 3713 1629 2407 1st ord rec. 110 197 215 48 181 142 2nd ord rec. 136 266 268 46 257 206 2nd diff 633 1791 1780 4960 1636 2963 9th deg. poly 701 1723 2729 10411 1576 5967 basic operations (Mflops) euroben mod1ac

The following table compares the performance of various intrinsics (EuroBen mod1f). For the SP, it also shows the effect of -O4 optimization versus -O3. (Revised 1/20/05)

xt3 sp3 -O4 sp3 -O3 sp4 -O4 X1 xd1 altix x**y 8.1 1.8 1.6 7.1 49 7.4 13.2 sin 36.6 34.8 8.9 64.1 97.9 33.7 22.9 cos 24.3 21.4 7.1 39.6 71.4 21.8 22.9 sqrt 99.2 52.1 34.1 93.9 711 91.8 107 exp 29.4 30.7 5.7 64.3 355 26.8 137 log 42.5 30.8 5.2 59.8 185 39.2 88.5 tan 29.3 18.9 5.5 35.7 85.4 27.0 21.1 asin 21.6 10.4 10.2 26.6 107 23.4 29.2 sinh 19.8 2.3 2.3 19.5 82.6 18.2 19.1 instrinsics (Mcalls/s) euroben mod1f (N=10000) The following table compares the performance (Mflops) of a simple FORTRAN matrix (REAL*8 400x400) multiply compared with the performance of DGEMM from the vendor math library (-lcxml for the Alpha, -lsci for the X1, -lessl for the SP). Note, the SP4 -lessl (3.3) is tuned for the Power4. Also the Mflops for 1000x1000 Linpack are reported from netlib except the sp4 number is from IBM. (Revised 10/29/04) alpha sp3 sp4 X1 xd1 altix ftn 72 45 220 7562 147 228 lib 1182 1321 3174 9482 3773 5222 linpack 1031 1236 2894 3955 The following plot compares the performance of the scientific library DGEMM. Sandia reports 4.26 Glops for DGEMM on their XT3. (Revised 1/20/05).

The following graph compares the vendor library implementation of an LU factorization (DGETRF) using partial pivoting with row interchanges. Revised 1/20/05

The following graph shows there is little performance difference between libgoto and AMD's -lacml library. Revised 1/20/05

The following plot compares the DAXPY performance of the Opteron and Itanium (Altix) using vendor math libraries. Revised 10/29/04

The following table compares the single processor performance (Mflops) of the Alpha and IBMs for the Euroben mod2g, a 2-D Haar wavelet transform test. (Revised 1/20/05)

|-------------------------------------------------------------------------- | Order | xt3 | altix | SP4 | X1 | xd1 | | n1 | n2 | (Mflop/s) | (Mflop/s) | (Mflop/s) | (Mflop/s)|(Mflop/s)| |-------------------------------------------------------------------------- | 16 | 16 | 471.05 | 150.4 | 126.42 | 10.5 | 364 | | 32 | 16 | 535.68 | 192.1 | 251.93 | 13.8 | 466 | | 32 | 32 | 611.52 | 262.3 | 301.15 | 20.0 | 468 | | 64 | 32 | 600.22 | 252.7 | 297.26 | 22.7 | 498 | | 64 | 64 | 529.22 | 242.5 | 278.45 | 25.9 | 351 | | 128 | 64 | 461.31 | 295.6 | 251.90 | 33.3 | 285 | | 128 | 128 | 362.02 | 350.2 | 244.45 | 48.5 | 198 | | 256 | 128 | 268.67 | 211.2 | 179.43 | 45.8 | 130 | | 256 | 256 | 172.92 | 133.3 | 103.52 | 46.7 | 138 | | 512 | 256 | 130.86 | 168.7 | 78.435 | 52.1 | 118 | |-------------------------------------------------------------------------- The following plots the performance (Mflops) of Euroben mod2b, a dense linear system test, for both optimized FORTRAN and using the BLAS from the vendor library.

The following plots the performance (Mflops) of Euroben mod2d, a dense eigenvalue test, using the BLAS from the vendor library.

The following plots the performance (iterations/second) of Euroben mod2e, a sparse eigenvalue test. (Revised 10/29/04)

The following figure shows the FORTRAN Mflops for one processor for various problem sizes for the EuroBen mod2f, a 1-D FFT. Data access is irregular, but cache effects are still apparent. (Revised 10/29/04).

The following compares a 1-D FFT using the FFTW benchmark.

The following graph plots 1-D FFT performance using the vendor library (-lacml, -lscs, -lsci or -lessl), initialization time is not included. Revised 10/29/04

MESSAGE-PASSING BENCHMARKS

The XT3 processors are intereconnected in a 3D torus using HyperTransport and Cray's SeaStar interconnect chip. The peak bidirectional badnwidth of an XT3 link is 7.6 GBs, 4GBs sustained. Internode communication can be accomplished with IP, PVM, or MPI. We report MPI performance over the Alpha Quadrics network and the IBM SP. Each SP node (4 CPUs) share a single network interface. However, each CPU is a unique MPI end point, so one can measure both inter-node and intra-node communication. The following table summarizes the measured communication characteristics between nodes of the X1, Alpha, SP3, and the SP4. SP4 is currently based on Colony switch via PCI. The XD1 uses Infiniband hardware with an interface to the Opteron Hyper-Transport. Latency is for 8-byte message. (Revised 11/26/03)

alpha altix sp4 X1 XD1 latency (1 way, us) 5.4 1.1 17 7.3 1.5 (X1 3.8 SHMEM, 3.9 coarray) bandwidth (echo, MBs) 199 1955 174 12125 1335 MPI within a node 622 1968 2186 966 latency (min, 1 way, us) and bandwidth (MBs) -- latency Bandwidth (min 1 way us, MBs) XT3 29.4 1136 ? alpha software? XD1 1.5 1335 XD1 in SMP 1.5 962 X1 node 7.3 11776 X1 MSP 7.3 12125 altix cpu 1.1 1968 alitx node 1.1 1955 alpha node 5.5 198 alpha cpu 5.8 623 alpha IP-sw 123 77 alpha IP-gigE/1500 76 44 alpha IP-100E 70 11 sp3 node 16.3 139 sp3 cpu 8.1 512 sp4 node 7 975 (Federation) sp4 node 6 1702 (Federation dual rail) sp4 node 17 174 (PCI/Colony) sp4 cpu 3 2186 sp3 IP-sw 82 46 sp3 IP-gigE/1500 91 47 sp3 IP-gigE/9000 136 84 sp3 IP-100E 93 12

Early results of bandwidth of MPI and SHMEM on XT3 (alpha software).

The following graph shows bandwidth for communication between two processors using MPI from both EuroBen's mod1h and ParkBench comms1. (ParkBench running on XT3 now, 3/8/05.) (Revised 1/24/05).

The HALO benchmark is a synthetic benchmark that simulates the nearest neighbour exchange of a 1-2 row/column "halo" from a 2-D array. This is a common operation when using domain decomposition to parallelize (say) a finite difference ocean model. There are no actual 2-D arrays used, but instead the copying of data from an array to a local buffer is simulated and this buffer is transfered between nodes.
For comparsion, we have included the Halo result for the X1 and ORNL's SP4 in the following table from Wallcraft ('98). (Revised 1/22/05)

LATENCY (us) MACHINE CPUs METHOD N=2 N=128 Cray XT3 16 MPI 150 150 ?alpha software Cray XD1 16 MPI 13 37 Cray X1 16 co-array 36 31 IBM SP4 16 MPI 27 32 SGI Altix 16 SHMEM 14 40 Cray X1 16 OpenMP 13 13 (SSP) Cray X1 16 SHMEM 35 47 SGI Altix 16 OpenMP 15 48 Cray T3E-900 16 SHMEM 20 68 SGI Altix 16 MPI 19 72 SUN E10000 16 OpenMP 24 102 Cray X1 16 MPI 91 116 SGI O2K 16 SHMEM 36 113 SGI O2K 16 OpenMP 33 119 IBM SP4 16 OpenMP 58 126 HP SPP2000 16 MPI 88 209 IBM SP 16 MPI 137 222 SGI O2K 16 MPI 145 247 The Halo benchmarks also compares various algorithms within a given paradigm. The following compares the performance using various MPI methods on 16 processors for different problem sizes. Revised 3/8/05

The following graph compares MPI for the HALO exchange on 4 and 16 processors. For smaller message sizes, the XD1 is the best performer. It is intersting that the X1 times are much higher than its 8-byte message latency. Revised 4/1/05

Until XT3 MPI latency is reduced, we will not bother with the following MPI tests. ... The following table shows the performance of aggregate communication operations (barrier, broadcast, sum-reduction) using one processor per node (N) and all processors on each node(n). Recall that the sp4 has 32 processors per node (sp3 and alpha, 4 per node). Communications is between MSP's on the X1 except for the UPC data. Times are in microseconds. (Revised 3/8/05)

mpibarrier (average us) X1 cpus alpha-N alpha-n sp4-n xd1 mpi shmem coarray upc XT3 2 7 11 3 2 3 3.0 3.2 6.1 39 4 7 16 5 4 3 3.2 3.4 7.1 79 8 8 18 7 6 5 4.8 4.9 8.5 119 16 9 21 9 10 6 5.6 5.8 7.3 159 32 11 28 10 16 5 6.3 6.6 11.0 199 64 37 68 5 7.1 7.2 12.1 239 128 6 10.0 9.9 300 9 19.9 24.3 504 10 19.0 17.7 mpibcast (8 bytes) X1 cpus alpha-N alpha-n sp4-n xd1 mpi shmem coarray upc XT3 2 9.6 12.5 3.2 1.2 5.9 1.4 .3 0 14.4 4 10.4 20.3 6.2 0.7 7.2 4.1 .8 0.5 29.1 8 11.4 28.5 8.4 1.2 10.5 10.0 1.2 1.0 43.5 16 12.5 32.9 9.8 17.9 16.3 20.4 1.9 1.2 58.2 32 13.8 41.4 11.3 20.7 27.5 41.6 4.0 1.5 72.4 64 48.7 89.0 48.1 83 7.9 2.7 86.7 mpireduce (SUM, doubleword) cpus alpha-N alpha-n sp3-N sp3-n sp4-n X1 XD1 XT3 2 9 11 8 9 6 8 0.7 15.5 4 190 207 29 133 9 11 30.5 crash? 8 623 350 271 484 13 15 68.8 16 1117 604 683 1132 18 19 108 32 3176 1991 1613 2193 29 23 215 64 5921 2841 3449 31 389 mpiallreduce (SUM, doubleword) cpus altix X1 XD1 XT3 2 5.7 11.4 2.4 35 4 14.5 16.3 4.3 70 8 22 24.9 6.7 104 16 30.3 31.6 10.8 151 32 39.3 45.5 16.6 191 48 47.4 53.2 49.1 64 58.9 66.1 68.4 233 96 80.0

A simple bisection bandwidth test has N/2 processors sending 1 MB messages to the other N/2. (Revised 1/20/05).

Aggregate datarate (MBs) cpus sp4 xt3 X1 Altix XD1 2 138 1099 12412 1074 963 X1 half populated cabinets 11/26/03 4 276 2162 16245 1150 1341 8 552 4392 15872 1304 2663 16 1040 8648 32626 2608 2319 32 3510 16736 29516 2608 4606 48 25944 35505 5064 3458 64 33961 55553 5120 6954 96 44222 7632 128 59292 10170 200 139536 252 168107 256 49664 X1 full cabinets 6/2/04 300 68595 400 120060 500 158350 504 167832

The following compares the aggregate MPI bandwidth for processor pairs doing an exchange, where node i exchanges with node i+n/2. Revised 1/22/05

Preliminary testing of TCP/IP performance over the local LAN showed that the XT3 GigE interfaces could run TCP at 898 Mbs. Wide area performance will be limited by default window size of 256KB, but the system manager can alter this.

SHARED-MEMORY BENCHMARKS

Preliminary SHMEM tests on the XT3 yield 8-bytes latency of 38 us and 1.1 GBs bandwidth ( alpha-software performance ). HALO performance on 16 processors is illustrated in the next plot. Revised 4/1/05

PARALLEL KERNEL BENCHMARKS

The following graph shows the aggregate Mflops for a conjugate gradient (CG) kernel (CLASS=A) from NAS Parallel Benchmarks 2.3 using MPI and OpenMP. Revised 1/24/05

The following plots the MPI performance of NPB 2.3 FT with a blocksize of 64 words.

The following plots the aggregate Mflop performance for ParkBench QR factorization (MPI) of 1000x1000 double precision matrix using the vendor scientific libraries (essl/cxml/sci). This benchmark uses BLACS (SCALAPACK). The small problem size results in small vectors and poor X1 performance. The XD1 used mpicc and mpi77 for building the QR kernel. Revised 12/2/04

The following plot shows the performance of high-performance Linpack (HPL) on 16 processors for the Cray X1 and IBM p690 with MPI and the vendor BLAS. HPL solves a (random) dense linear system in double precision (64 bits) using: Two-dimensional block-cyclic data distribution - Right-looking variant of the LU factorization with row partial pivoting featuring multiple look-ahead depths - Recursive panel factorization with pivot search and column broadcast combined - Various virtual panel broadcast topologies - bandwidth reducing swap-broadcast algorithm - backward substitution with look-ahead of depth 1. Cray has reported 90% of peak on the X1 using SHMEM instead of MPI. Revised 3/17/05

HPCC results (see hpcc ) for 64 processors. XT3 results with N=10000 revised 3/17/05

HPL PTRANS GUPS triad Bndwth latency tflops GBs GBs us Cray X1 0.522 12.4 0.01 30 0.99 20.2 Cray XD1 0.224 10.6 0.02 2.7 0.22 1.63 Cray XT3 0.138 12.1 0.04 2.6 1.14 29.6 Sandia (2/23/05) reports HPCC numbers on 552 nodes of 1.4 Tflops for HPL (55% peak). PTRANS at 49.6 GBs,

LINKS

also see our low-level results for Cray X1 and Cray XD1 and SGI Altix and IBM p690 and opteron cluster
Cray XT3 and PSC news release
yod
Opteron architecture
NWCHEM DFT performance
AMD opteron benchmarks
AMD's ACML library or here or Opteron library libgoto
Opteron bios and kernel developer's guide
papi
hpet high precision timers and rdtsc timers and dclock
processor for Sandia's Red Storm

Research Sponsors

Mathematical, Information, and Computational Sciences Division, within the Office of Advanced Scientific Computing Research of the Office of Science, Department of Energy. The application-specific evaluations are also supported by the sponsors of the individual applications research areas.