Dunigan's Cray XD1 Testing

ORNL Cray XD1 Evaluation

.... this is work in progress.... last revised

October, 2004. ORNL is evaluating the Cray XD1 (formally Octigabay), an opteron-based cluster interconnected with Infiniband.

Revision history:

2/24/05 faster streams numbers with AMD tips for pgf77 12/3/04 libgoto vs -lacml for dgemm/lu, NPB CG,FT with pg compile 12/2/04 recompiles and re-test with pgi/5.2-4, parkbench QR, MAPS 11/24/04 gpshmem 11/23/04 exchange on 64, halo update 11/16/04 64 processor tests 11/8/04 mpi clock synch tests -- not synchronized 11/4/04 6 chasis tests 11/2/04 MPI tests 11/1/04 ParkBench, euroben-dm, hpl 10/29 initial XD1 tests (3 chasis 34 cpus)

The results presented here are from standard benchmarks and some custom benchmarks and, as such, represent only one part of the evaluation. An IBM SP4, IBM Winterhawk II (noted as SP3 in the following tables and graphs),Cray X1, SGI Altix (Itanium 2), an Opteron cluster using Quadrix 3, and Compaq Alpha ES40 at ORNL were used for comparison with the XD1. The results below are in the following categories:

architecture -- configuration summaries
benchmarks -- benchmark descriptions
memory performance -- memory performance
low-level results -- base CPU and intrinsics results
shared-memory results -- single node/thread performance
message passing results -- latency/bandwith, collectives
kernels -- application kernel results

ARCHITECTURE

The present ORNL XD1 test cluster consists of 18 2-way SMP Opteron with 4 GB of memory (2.2 GHz). One of the SMP's is partitioned as the compile node, leaving 34 processors for MPI testing. The X1 at ORNL has 8 nodes. Each node has 4 MSPs, each MSP has 4 SSPs, and each SSP has two vector units. The Cray "processor/CPU" in the results below is one MSP. All 4 MSP's on a node share memory. The Power4 consists of one node with 32 processors sharing memory. Both the Alpha and SP3 consist of four processors sharing memory on a single node. All memory is accessible on the Altix. The following table summarizes the main characteristics of the machines

Specs: Alpha SC SP3 SP4 X1 XD1 Altix MHz 667 375 1300 800 2200 1500 memory/node 2GB 2GB 32GB 16GB 4GB 512GB L1 64K 64K 32K 16K 64k 32K L2 8MB 8MB 1.5MB 2MB 1MB 256K L3 128MB 6MB peak Mflops 2*MHz 4*MHz 4*MHz 12.8 2*MHz 4*MHz peak mem BW 5.2GBs 1.6GBs 200+ GBs ?200+GBS 6.4GBs 6.4 GBs alpha 2 buses @ 2.6 GBs each X1 memory bandwidth is 34 GB/s/CPU. For the Alpha, nodes are interconnected with a Quadrics switch organized as a fat tree. The SP nodes are interconnected with cross-bar switches in an Omega-like network. The X1 uses a modified 2-D torus.

BENCHMARKS

We have used widely available benchmarks in combination with our own custom benchmarks to characterize the performance of the X1. Some of the older benchmarks may need to be modified for these newer faster machines -- increasing repetitions to avoid 0 elapsed times, increasing problem sizes to test out of cache performance. Unless otherwise noted, the following compiler switches were used on the Alpha and SP.

XD1: -O3 (pgf90 v5.2-2)) X1: -Oaggress,stream2 (arpun -n xxx -p 64k:16m a.out) Alpha: -O4 -fast -arch ev6 SP: -O4 -qarch=auto -qtune=auto -qcache=auto -bmaxdata:0x70000000 Benchmarks were in C, FORTRAN, and FORTRAN90/OpenMP. We also compared performance with the vendor runtime libraries, sci(X1), cxml (Alpha), acml (AMD opteron), and essl (SP). We used the following benchmarks in our tests:

ParkBench 2.1 -- provides low-level sequential and communication benchmarks, parallel linear algebra benchmarks, NAS parallel benchmarks, and compact application codes. Here is a summary of the benchmark modules. Codes are in FORTRAN. Results are often reported as least-squares fit of data. We report actual performance numbers.
EuroBen 3.9 -- provides serial benchmarks for low-level performance and applicaton kernels (linear algebra, eigen value, FFT, QR). Here is a summary of the benchmark modules. euroben-dm provides some communication and parallel (MPI) benchmarks. The web site includes results from other systems.
lmbench -- provides insight into OS (UNIX) performance and memory latencies. The web site includes results from other systems.
stream -- measures memory bandwidth for both serial and parallel configurations. Also we use the MAPS memory benchmark. The web sites include results from other systems.
Custom low-level benchmarks that we have used over the years in evaluating memory and communication performance.

For both the Alpha and the SP, gettimeofday() provides microsecond wall-clock time (though one has to be sure MICROTIME option is set in the Alpha OS kernel). Both have high-resolution cylce counters as well, but the Alpha cycle counter is only 32-bits so rolls over in less than 7 seconds. For distributed benchmarks (MPI), the IBM and Alpha systems provide a hardware synchronized MPI_Wtime() with microsecond resolution.

MEMORY PERFORMANCE

The stream benchmark is a program that measures main memory throughput for several simple operations. The aggregate data rate for multiple threads is reported in the following table. Recall, that the "peak" memory data rate for the X1 is 200 GBs, Alpha is 5.2 GBs, and for the SP3 is 1.6 GBs. Data for the 16-way SP3 (375 Mhz, Nighthawk II) is included too. Data for the Alpha ES45 (1 GHz) is obtained from the streams data base. Data for p690/sp4 is with affinity enabled (6/1/02). The X1 uses (aprun -A). The Opteron is supposed to have greater than 6.4 GB/s/cpu memory bandwidth, we don't see that yet with stream_d.c?

MBs copy scale add triad xd1 2427 2369 2577 2598 pgcc -O3 pgi/5.2-4 xd1 2577 2491 2662 2641 pgf77 -O3 stream_d.f xd1 2808 2793 3192 3186 pgf77 -O3 -mp -Mnontemporal -fastsse -Munsafe_par_align -o stream_d xd1-2 7494 7473 8644 8553 2 threads/cpu's xd1 3134 3169 3213 3217 PathScale compiler (from STREAMS site) xd1 4098 4091 4221 4049 PathScale -O3 -LNO:prefetch=2 -LNO:prefetch_ahead=9 -LNO:fusion=2 -CG:use_prefetchnta=on xt3 2639 2527 2652 2788 pgf77 just O3 stream_d.f (2.4 GHz) xt3 3930 3791 4211 4308 pgf77 -O3 -mp -Mnontemporal -fastsse -Munsafe_par_align xt3 3459 3747 3968 3968 reported by Sandia 2/23/05 xt3 4924 4928 4868 4830 reported by AMD 2.4 GHz 2/8/05 pathscale xt3 4935 4934 5053 5069 pathscale -O3 -LNO:prefetch=2 -LNO:prefetch_ahead=9 -LNO:fusion=2 -CG:use_prefetchnta=on altix 3214 3169 3800 3809 X1 22111 21634 23658 23752 alpha1 1339 1265 1273 1383 es45-1 1946 1941 1978 1978 SP3 1 523 561 581 583 SP3/16-1 486 494 601 601 SP4-1 1774 1860 2098 2119 p690 p655 3649 3767 2899 2913 ibm p655 11/3/04 From AMD's published spec benchmark and McCalpin's suggested conversion of 171.swim results to triad memory bandwidth, we get 2.7 GB/s memory bandwidth for one Opteron processor.

The MAPS benchmark also characterizes memory access performance. Plotted are load/store bandwidth for sequential (stride 1) and random access. Load is calculated from s=s+x(i)*y(i) and store from x(i)= s. Revised 12/2/04

The tabletoy benchmark (C) makes random writes of 64-bit integers in a shared memory, parallelization is permitted with possibly non-coherent updates. The X1 number is for vectorizing the inner loop (multistreaming was an order of magnitude slower 88 MBs). Data rate in the following table is for a 268MB table. We include multi-threaded altix, opteron, sp3 (NERSC), and sp4 data as well. Revised 11/22/04

MBs (using wallclock time) sp4-1 26 altix-1 42 X1-msp-1 1190 opteron-1 44 sp3-1 8 sp4-2 47 altix-2 45 opteron-2 67 sp3-2 26 sp4-4 98 altix-4 62 sp3-4 53 sp4-8 174 altix-8 86 sp3-8 90 sp4-16 266 altix-16 69 sp3-16 139 sp4-32 322 altix-32 77

The hint benchmark measures computation and memory efficiency as the problem size increases. (This is C hint version 1, 1994.) The following graph shows the performance of a single processor for the XD1 (164 MQUIPS), X1 (12.2 MQUIPS), Alpha (66.9 MQUIPS), Altix (88.2 MQUIPS), and SP4 (74.9 MQUIPS). The L1 and L2 cache boundaries are visible, as well as the Altix and SP4's L3. Revised 10/29/04

Here results from LMbench, a test of various OS services.

LOW LEVEL BENCHMARKS (single processor)

The following table compares the performance of the X1, Alpha, and SP for basic CPU operations. These numbers are the average Mflop/s from the first few kernels of EuroBen's mod1ac. The 14th kernel (9th degree poly) is a rough estimate of peak FORTRAN performance since it has a high re-use of operands. (Revised 10/29/04)

alpha sp3 sp4 X1 xd1 Altix broadcast 516 368 1946 2483 1100 2553 copy 324 295 991 2101 733 1758 addition 285 186 942 1957 725 1271 subtraction 288 166 968 1946 720 1307 multiply 287 166 935 2041 726 1310 division 55 64 90 608 275 213 dotproduct 609 655 2059 3459 730 724 X=X+aY 526 497 1622 4134 1088 2707 Z=X+aY 477 331 1938 3833 1092 2632 y=x1x2+x3x4 433 371 2215 3713 1629 2407 1st ord rec. 110 107 215 48 181 142 2nd ord rec. 136 61 268 46 257 206 2nd diff 633 743 1780 4960 1636 2963 9th deg. poly 701 709 2729 10411 1576 5967 basic operations (Mflops) euroben mod1ac

The following table compares the performance of various intrinsics (EuroBen mod1f). For the SP, it also shows the effect of -O4 optimization versus -O3. (Revised 10/29/04)

alpha sp3 -O4 sp3 -O3 sp4 -O4 X1 xd1 altix x**y 8.3 1.8 1.6 7.1 49 7.4 13.2 sin 13 34.8 8.9 64.1 97.9 33.7 22.9 cos 12.8 21.4 7.1 39.6 71.4 21.8 22.9 sqrt 45.7 52.1 34.1 93.9 711 91.8 107 exp 15.8 30.7 5.7 64.3 355 26.8 137 log 15.1 30.8 5.2 59.8 185 39.2 88.5 tan 9.9 18.9 5.5 35.7 85.4 27.0 21.1 asin 13.3 10.4 10.2 26.6 107 23.4 29.2 sinh 10.7 2.3 2.3 19.5 82.6 18.2 19.1 instrinsics (Mcalls/s) euroben mod1f (N=10000) The following table compares the performance (Mflops) of a simple FORTRAN matrix (REAL*8 400x400) multiply compared with the performance of DGEMM from the vendor math library (-lcxml for the Alpha, -lsci for the X1, -lessl for the SP). Note, the SP4 -lessl (3.3) is tuned for the Power4. Also the Mflops for 1000x1000 Linpack are reported from netlib except the sp4 number is from IBM. (Revised 10/29/04) alpha sp3 sp4 X1 xd1 altix ftn 72 45 220 7562 147 228 lib 1182 1321 3174 9482 3773 5222 linpack 1031 1236 2894 3955 The following plot compares the performance of the scientific library DGEMM. (Revised 10/29/04).

The following graph compares the vendor library implementation of an LU factorization (DGETRF) using partial pivoting with row interchanges. Revised 11/1/04

The following graph shows there is little performance difference between libgoto and AMD's -lacml library. Revised 12/3/04

The following plot compares the DAXPY performance of the Opteron and Itanium (Altix) using vendor math libraries. Revised 10/29/04

The following table compares the single processor performance (Mflops) of the Alpha and IBMs for the Euroben mod2g, a 2-D Haar wavelet transform test. (Revised 10/29/04)

|-------------------------------------------------------------------------- | Order | alpha | altix | SP4 | X1 | xd1 | | n1 | n2 | (Mflop/s) | (Mflop/s) | (Mflop/s) | (Mflop/s)|(Mflop/s)| |-------------------------------------------------------------------------- | 16 | 16 | 142.56 | 150.4 | 126.42 | 10.5 | 364 | | 32 | 16 | 166.61 | 192.1 | 251.93 | 13.8 | 466 | | 32 | 32 | 208.06 | 262.3 | 301.15 | 20.0 | 468 | | 64 | 32 | 146.16 | 252.7 | 297.26 | 22.7 | 498 | | 64 | 64 | 111.46 | 242.5 | 278.45 | 25.9 | 351 | | 128 | 64 | 114.93 | 295.6 | 251.90 | 33.3 | 285 | | 128 | 128 | 104.46 | 350.2 | 244.45 | 48.5 | 198 | | 256 | 128 | 86.869 | 211.2 | 179.43 | 45.8 | 130 | | 256 | 256 | 71.033 | 133.3 | 103.52 | 46.7 | 138 | | 512 | 256 | 65.295 | 168.7 | 78.435 | 52.1 | 118 | |-------------------------------------------------------------------------- The following plots the performance (Mflops) of Euroben mod2b, a dense linear system test, for both optimized FORTRAN and using the BLAS from the vendor library.

The following plots the performance (Mflops) of Euroben mod2d, a dense eigenvalue test, using the BLAS from the vendor library.

The following plots the performance (iterations/second) of Euroben mod2e, a sparse eigenvalue test. (Revised 10/29/04)

The following figure shows the FORTRAN Mflops for one processor for various problem sizes for the EuroBen mod2f, a 1-D FFT. Data access is irregular, but cache effects are still apparent. (Revised 10/29/04).

The following compares a 1-D FFT using the FFTW benchmark.

The following graph plots 1-D FFT performance using the vendor library (-lacml, -lscs, -lsci or -lessl), initialization time is not included. Revised 10/29/04

MESSAGE-PASSING BENCHMARKS

Internode communication can be accomplished with IP, PVM, or MPI. We report MPI performance over the Alpha Quadrics network and the IBM SP. Each SP node (4 CPUs) share a single network interface. However, each CPU is a unique MPI end point, so one can measure both inter-node and intra-node communication. The following table summarizes the measured communication characteristics between nodes of the X1, Alpha, SP3, and the SP4. SP4 is currently based on Colony switch via PCI. The XD1 uses Infiniband hardware (2 GBs links) with an interface to the Opteron Hyper-Transport. Latency is for 8-byte message. (Revised 11/26/03)

alpha sp3 sp4 X1 XD1 latency (1 way, us) 5.4 16.3 17 7.3 1.5 (X1 3.8 SHMEM, 3.9 coarray) bandwidth (echo, MBs) 199 139 174 12125 1335 MPI within a node 622 512 2186 966 latency (min, 1 way, us) and bandwidth (MBs) -- latency Bandwidth (min 1 way us, MBs) XD1 1.5 1335 XD1 in SMP 1.5 962 X1 node 7.3 11776 X1 MSP 7.3 12125 altix cpu 1.1 1968 alitx node 1.1 1955 alpha node 5.5 198 alpha cpu 5.8 623 alpha IP-sw 123 77 alpha IP-gigE/1500 76 44 alpha IP-100E 70 11 sp3 node 16.3 139 sp3 cpu 8.1 512 sp4 node 7 975 (Federation) sp4 node 6 1702 (Federation dual rail) sp4 node 17 174 (PCI/Colony) sp4 cpu 3 2186 sp3 IP-sw 82 46 sp3 IP-gigE/1500 91 47 sp3 IP-gigE/9000 136 84 sp3 IP-100E 93 12

The following graph shows bandwidth for communication between two processors on the same node using MPI from both EuroBen's mod1h and ParkBench comms1. Within a node, shared memory can be used by MPI. (Revised 11/1/04).

Interestingly on the XD1, MPI is slower within an SMP than across the Infiniband.

The following graph shows the minimum MPI latency (one-way, e.g., half of RTT) for an 8 byte message from CPU 0 to the other CPUs for the Altix and Cray X1 and XD1. Revised 3/31/05 (new altix data).

The HALO benchmark is a synthetic benchmark that simulates the nearest neighbour exchange of a 1-2 row/column "halo" from a 2-D array. This is a common operation when using domain decomposition to parallelize (say) a finite difference ocean model. There are no actual 2-D arrays used, but instead the copying of data from an array to a local buffer is simulated and this buffer is transfered between nodes.
For comparsion, we have included the Halo result for the X1 and ORNL's SP4 in the following table from Wallcraft ('98). (Revised 10/29/04)

LATENCY (us) MACHINE CPUs METHOD N=2 N=128 Cray XD1 16 MPI 13 37 Cray X1 16 co-array 36 31 IBM SP4 16 MPI 27 32 SGI Altix 16 SHMEM 14 40 Cray X1 16 OpenMP 13 13 (SSP) Cray X1 16 SHMEM 35 47 SGI Altix 16 OpenMP 15 48 Cray T3E-900 16 SHMEM 20 68 SGI Altix 16 MPI 19 72 SUN E10000 16 OpenMP 24 102 Cray X1 16 MPI 91 116 SGI O2K 16 SHMEM 36 113 SGI O2K 16 OpenMP 33 119 IBM SP4 16 OpenMP 58 126 HP SPP2000 16 MPI 88 209 IBM SP 16 MPI 137 222 SGI O2K 16 MPI 145 247 The Halo benchmarks also compares various algorithms within a given paradigm. The following compares the performance using various MPI methods on 16 MSPs for different problem sizes. Two of the methods failed with MPI errors on the XD1. Revised 11/24/04

The following graph compares MPI for the HALO exchange on 4 and 16 processors. For smaller message sizes, the XD1 is the best performer. It is intersting that the X1 times are much higher than its 8-byte message latency. Revised 10/29/04

The following table shows the performance of aggregate communication operations (barrier, broadcast, sum-reduction) using one processor per node (N) and all processors on each node(n). Recall that the sp4 has 32 processors per node (sp3 and alpha, 4 per node). Communications is between MSP's on the X1 except for the UPC data. Times are in microseconds. (Revised 11/16/04)

mpibarrier (average us) X1 cpus alpha-N alpha-n sp3-N sp3-n sp4-n xd1 mpi shmem coarray upc 2 7 11 22 10 3 2 3 3.0 3.2 6.1 4 7 16 45 20 5 4 3 3.2 3.4 7.1 8 8 18 69 157 7 6 5 4.8 4.9 8.5 16 9 21 93 230 9 10 6 5.6 5.8 7.3 32 11 28 118 329 10 16 5 6.3 6.6 11.0 64 37 145 419 68 5 7.1 7.2 12.1 128 6 10.0 9.9 300 9 19.9 24.3 504 10 19.0 17.7 mpibcast (8 bytes) X1 cpus alpha-N alpha-n sp3-N sp3-n sp4-n xd1 mpi shmem coarray upc 2 9.6 12.5 5.4 6.7 3.2 1.2 5.9 1.4 .3 0 4 10.4 20.3 9.4 9.4 6.2 0.7 7.2 4.1 .8 0.5 8 11.4 28.5 13.4 17.5 8.4 1.2 10.5 10.0 1.2 1.0 16 12.5 32.9 17.0 20.9 9.8 17.9 16.3 20.4 1.9 1.2 32 13.8 41.4 19.3 24.1 11.3 20.7 27.5 41.6 4.0 1.5 64 48.7 23.6 30.8 89.0 48.1 83 7.9 2.7 mpireduce (SUM, doubleword) cpus alpha-N alpha-n sp3-N sp3-n sp4-n X1 XD1 2 9 11 8 9 6 8 0.7 4 190 207 29 133 9 11 30.5 8 623 350 271 484 13 15 68.8 16 1117 604 683 1132 18 19 108 32 3176 1991 1613 2193 29 23 215 64 5921 2841 3449 31 389 mpiallreduce (SUM, doubleword) cpus altix X1 XD1 2 5.7 11.4 2.4 4 14.5 16.3 4.3 8 22 24.9 6.7 16 30.3 31.6 10.8 32 39.3 45.5 16.6 48 47.4 53.2 49.1 64 58.9 66.1 68.4 96 80.0

A simple bisection bandwidth test has N/2 processors sending 1 MB messages to the other N/2. (Revised 11/16/04).

Aggregate datarate (MBs) cpus sp4 alpha X1 Altix XD1 2 138 195 12412 1074 963 X1 half populated cabinets 11/26/03 4 276 388 16245 1150 1341 8 552 752 15872 1304 2663 16 1040 1400 32626 2608 2319 32 3510 29516 2608 4606 48 35505 5064 3458 64 55553 5120 6954 96 44222 7632 128 59292 10170 200 139536 252 168107 256 49664 X1 full cabinets 6/2/04 300 68595 400 120060 500 158350 504 167832

The following compares the aggregate MPI bandwidth for processor pairs doing an exchange, where node i exchanges with node i+n/2. Revised 11/23/04

Preliminary testing of TCP/IP performance over the local LAN showed that the XT3 GigE interfaces could run TCP at 817 Mbs. Wide area performance will be limited by default window size of 256KB, but the system manager can alter this.

SHARED-MEMORY BENCHMARKS

Since there are only 2 SMP OPTERON processors, we will not bother with various threading benchmarks.

SHMEM support for the XD1 is provided via GPShmem. A symmetric heap will not be supported til 1.3. GPShmem uses MPI for collective operations, but not for puts and gets. Asynchronous operations are implemented on top of ARMCI which directly implements them using the native I/O. One-way latency for gpshmem for an 8-byte message is about 4.8 us, slower than MPI at this time. The following graph compares gpshmem bandwidth with MPI between two XD1 nodes. Revised 11/24/04

UPC and co-array fortran are not yet supported.

PARALLEL KERNEL BENCHMARKS

The following graph shows the aggregate Mflops for a conjugate gradient (CG) kernel (CLASS=A) from NAS Parallel Benchmarks 2.3 using MPI and OpenMP. Revised 12/3/04

The following plots the MPI performance of NPB 2.3 FT with a blocksize of 64 words.

The following plots the aggregate Mflop performance for ParkBench QR factorization (MPI) of 1000x1000 double precision matrix using the vendor scientific libraries (essl/cxml/sci). This benchmark uses BLACS (SCALAPACK). The small problem size results in small vectors and poor X1 performance. The XD1 used mpicc and mpi77 for building the QR kernel. Revised 12/2/04

The following plot shows the performance of high-performance Linpack (HPL) on 16 processors for the Cray X1 and IBM p690 with MPI and the vendor BLAS. HPL solves a (random) dense linear system in double precision (64 bits) using: Two-dimensional block-cyclic data distribution - Right-looking variant of the LU factorization with row partial pivoting featuring multiple look-ahead depths - Recursive panel factorization with pivot search and column broadcast combined - Various virtual panel broadcast topologies - bandwidth reducing swap-broadcast algorithm - backward substitution with look-ahead of depth 1. Cray has reported 90% of peak on the X1 using SHMEM instead of MPI. Revised 11/1/04

LINKS

also see our low-level results for Cray X1 and SGI Altix and IBM p690 and opteron cluster
Cray XD1
more xd1 config info
gpshmem
ARMCI
Opteron architecture
NWCHEM DFT performance
AMD opteron benchmarks
AMD's ACML library or here or Opteron library libgoto
Opteron bios and kernel developer's guide
HPCC XD1 results
papi
hpet high precision timers and rdtsc timers and dclock
processor for Sandia's Red Storm
OSC's RDMA performance on XD1

Research Sponsors

Mathematical, Information, and Computational Sciences Division, within the Office of Advanced Scientific Computing Research of the Office of Science, Department of Energy. The application-specific evaluations are also supported by the sponsors of the individual applications research areas.