Dunigan's Opteron Testing

ORNL Opteron Evaluation (Dunigan)

.... this is work in progress.... last revised

The results presented here are from standard benchmarks and some custom benchmarks and, as such, represent only one part of the evaluation. An IBM SP4, IBM Winterhawk II (noted as SP3 in the following tables and graphs),Cray X1, SGI Altix (Itanium 2), and Compaq Alpha ES40 at ORNL were used for comparison with the Opteron. The results below are in the following categories:

architecture -- configuration summaries
benchmarks -- benchmark descriptions
memory performance -- memory performance
low-level results -- base CPU and intrinsics results
shared-memory results -- single node/thread performance
message passing results -- latency/bandwith, collectives
kernels -- application kernel results

ARCHITECTURE

The Opteron has 2 cpu's and 2 GB of memory. The X1 at ORNL has 8 nodes. Each node has 4 MSPs, each MSP has 4 SSPs, and each SSP has two vector units. The Cray "processor/CPU" in the results below is one MSP. All 4 MSP's on a node share memory. The Power4 consists of one node with 16 processors (2 MCM's) sharing memory. Both the Alpha and SP3 consist of four processors sharing memory on a single node. The following table summarizes the main characteristics of the machines

Specs: Alpha SC SP3 SP4 X1 Opteron Altix MHz 667 375 1300 800 1600 1300 memory/node 2GB 2GB 32GB 16GB 2GB 512GB L1 64K 64K 32K 16K 64k 32K L2 8MB 8MB 1.5MB 2MB 1MB 256K L3 128MB 3MB peak Mflops 2*MHz 4*MHz 4*MHz 12.8 2*MHz 4*MHz peak mem BW 5.2GBs 1.6GBs 200+ GBs ?200+GBS 5.3GBs 6.4 GBs alpha 2 buses @ 2.6 GBs each X1 memory bandwidth is 34 GB/s/CPU. For the Alpha, nodes are interconnected with a Quadrics switch organized as a fat tree. The SP nodes are interconnected with cross-bar switches in an Omega-like network. The X1 uses a modified 2-D torus.

BENCHMARKS

We have used widely available benchmarks in combination with our own custom benchmarks to characterize the performance of the X1. Some of the older benchmarks may need to be modified for these newer faster machines -- increasing repetitions to avoid 0 elapsed times, increasing problem sizes to test out of cache performance. Unless otherwise noted, the following compiler switches were used on the Alpha and SP.

opteron: -O3 (pgf90) X1: -Oaggress,stream2 (arpun -n xxx -p 64k:16m a.out) Alpha: -O4 -fast -arch ev6 SP: -O4 -qarch=auto -qtune=auto -qcache=auto -bmaxdata:0x70000000 Benchmarks were in C, FORTRAN, and FORTRAN90/OpenMP. We also compared performance with the vendor runtime libraries, sci(X1), cxml (Alpha) and essl (SP). We used the following benchmarks in our tests:

ParkBench 2.1 -- provides low-level sequential and communication benchmarks, parallel linear algebra benchmarks, NAS parallel benchmarks, and compact application codes. Here is a summary of the benchmark modules. Codes are in FORTRAN. Results are often reported as least-squares fit of data. We report actual performance numbers.
EuroBen 3.9 -- provides serial benchmarks for low-level performance and applicaton kernels (linear algebra, eigen value, FFT, QR). Here is a summary of the benchmark modules. euroben-dm provides some communication and parallel (MPI) benchmarks. The web site includes results from other systems.
lmbench -- provides insight into OS (UNIX) performance and memory latencies. The web site includes results from other systems.
stream -- measures memory bandwidth for both serial and parallel configurations. Also we use the MAPS memory benchmark. The web sites include results from other systems.
Custom low-level benchmarks that we have used over the years in evaluating memory and communication performance.

For both the Alpha and the SP, gettimeofday() provides microsecond wall-clock time (though one has to be sure MICROTIME option is set in the Alpha OS kernel). Both have high-resolution cylce counters as well, but the Alpha cycle counter is only 32-bits so rolls over in less than 7 seconds. For distributed benchmarks (MPI), the IBM and Alpha systems provide a hardware synchronized MPI_Wtime() with microsecond resolution. On the Alpha, MPI_Wtime is frequency synchonized, but initial offsets are only approximate. (On the Alpha, it appears MPI_Init tries to provide an initial zero offset to the Elan counters on each node when an MPI job starts. On the SP3, we discovered several nodes that were not synchronized, a patch was eventually provided.) Time is not syncrhonized on the X1.

MEMORY PERFORMANCE

The stream benchmark is a program that measures main memory throughput for several simple operations. The aggregate data rate for multiple threads is reported in the following table. Recall, that the "peak" memory data rate for the X1 is 200 GBs, Alpha is 5.2 GBs, and for the SP3 is 1.6 GBs. Data for the 16-way SP3 (375 Mhz, Nighthawk II) is included too. Data for the Alpha ES45 (1 GHz) is obtained from the streams data base. Data for p690/sp4 is with affinity enabled (6/1/02). The X1 uses (aprun -A). The Opteron is supposed to have greater than 5.3 GB/s/cpu memory bandwidth, we don't see that yet?

MBs copy scale add triad opteron 1594 1757 1767 1915 2 cpus 2667 2667 3000 3000 altix 3214 3169 3800 3809 X1 22111 21634 23658 23752 alpha1 1339 1265 1273 1383 es45-1 1946 1941 1978 1978 SP3 1 523 561 581 583 SP3/16-1 486 494 601 601 SP4-1 1774 1860 2098 2119 From AMD's published spec benchmark and McCalpin's suggested conversion of 171.swim results to triad memory bandwidth, we get 2.7 GB/s memory bandwidth for one Opteron processor.

The MAPS benchmark also characterizes memory access performance. Plotted are load/store bandwidth for sequential (stride 1) and random access.

The hint benchmark measures computation and memory efficiency as the problem size increases. (This is C hint version 1, 1994.) The following graph shows the performance of a single processor for the Opteron (147 MQUIPS), X1 (12.2 MQUIPS), Alpha (66.9 MQUIPS), Altix (88.2 MQUIPS), and SP4 (74.9 MQUIPS). The L1 and L2 cache boundaries are visible, as well as the Altix and SP4's L3.

Here results from LMbench.

LOW LEVEL BENCHMARKS (single processor)

The following table compares the performance of the X1, Alpha, and SP for basic CPU operations. These numbers are from the first few kernels of EuroBen's mod1ac. The 14th kernel (9th degree poly) is a rough estimate of peak FORTRAN performance since it has a high re-use of operands. (Revised 4/8/03)

alpha sp3 sp4 X1 opteron Altix broadcast 516 368 1946 2483 631 2309 copy 324 295 991 2101 343 1526 addition 285 186 942 1957 262 839 subtraction 288 166 968 1946 254 852 multiply 287 166 935 2041 263 855 division 55 64 90 608 185 136 dotproduct 609 655 2059 3459 520 545 X=X+aY 526 497 1622 4134 793 1727 Z=X+aY 477 331 1938 3833 794 1719 y=x1x2+x3x4 433 371 2215 3713 751 1809 1st ord rec. 110 107 215 48 265 124 2nd ord rec. 136 61 268 46 352 179 2nd diff 633 743 1780 4960 956 2575 9th deg. poly 701 709 2729 10411 1110 5180 basic operations (Mflops) euroben mod1ac

The following table compares the performance of various intrinsics (EuroBen mod1f). For the SP, it also shows the effect of -O4 optimization versus -O3. (Revised 4/8/03)

alpha sp3 -O4 sp3 -O3 sp4 -O4 X1 opteron altix x**y 8.3 1.8 1.6 7.1 49 5.4 11.4 sin 13 34.8 8.9 64.1 97.9 10.9 19.9 cos 12.8 21.4 7.1 39.6 71.4 6.1 19.9 sqrt 45.7 52.1 34.1 93.9 711 66.6 137 exp 15.8 30.7 5.7 64.3 355 9.2 123 log 15.1 30.8 5.2 59.8 185 10.1 72.4 tan 9.9 18.9 5.5 35.7 85.4 11.9 18.3 asin 13.3 10.4 10.2 26.6 107 16.9 25.3 sinh 10.7 2.3 2.3 19.5 82.6 9.3 16.6 instrinsics (Mcalls/s) euroben mod1f (N=10000) The following table compares the performance (Mflops) of a simple FORTRAN matrix (REAL*8 400x400) multiply compared with the performance of DGEMM from the vendor math library (-lcxml for the Alpha, -lsci for the X1, -lessl for the SP). Note, the SP4 -lessl (3.3) is tuned for the Power4. Also the Mflops for 1000x1000 Linpack are reported from netlib except the sp4 number is from IBM. (Revised 4/8/03) alpha sp3 sp4 X1 opteron altix ftn 72 45 220 7562 110 205 lib 1182 1321 3174 9482 2778 4591 linpack 1031 1236 2894 3955 The following plot compares the performance of the scientific library DGEMM. We also compare libgoto AMD's -lacml library. (Revised 8/27/03).

The following table compares the single processor performance (Mflops) of the Alpha and IBMs for the Euroben mod2g, a 2-D Haar wavelet transform test. (Revised 4/8/03)

|-------------------------------------------------------------------------- | Order | alpha | altix | SP4 | X1 | opteron | | n1 | n2 | (Mflop/s) | (Mflop/s) | (Mflop/s) | (Mflop/s)|(Mflop/s)| |-------------------------------------------------------------------------- | 16 | 16 | 142.56 | 130.9 | 126.42 | 10.5 | 221 | | 32 | 16 | 166.61 | 165.2 | 251.93 | 13.8 | 256 | | 32 | 32 | 208.06 | 218.9 | 301.15 | 20.0 | 293 | | 64 | 32 | 146.16 | 208.77 | 297.26 | 22.7 | 293 | | 64 | 64 | 111.46 | 199.08 | 278.45 | 25.9 | 271 | | 128 | 64 | 114.93 | 240.10 | 251.90 | 33.3 | 241 | | 128 | 128 | 104.46 | 282.61 | 244.45 | 48.5 | 197 | | 256 | 128 | 86.869 | 186.84 | 179.43 | 45.8 | 150 | | 256 | 256 | 71.033 | 120.53 | 103.52 | 46.7 | 101 | | 512 | 256 | 65.295 | 142.83 | 78.435 | 52.1 | 83 | |-------------------------------------------------------------------------- The following plots the performance (Mflops) of Euroben mod2b, a dense linear system test, for both optimized FORTRAN and using the BLAS from the vendor library (cxml/essl).

The following plots the performance (Mflops) of Euroben mod2d, a dense eigenvalue test, for both optimized FORTRAN and using the BLAS from the vendor library. For the Alpha, -O4 optimization failed, so this data uses -O3.

The following plots the performance (iterations/second) of Euroben mod2e, a sparse eigenvalue test. (Revised 4/9/03)

The following figure shows the FORTRAN Mflops for one processor for various problem sizes for the EuroBen mod2f, a 1-D FFT. Data access is irregular, but cache effects are still apparent. (Revised 4/10/03).

MESSAGE-PASSING BENCHMARKS

LINKS

NWCHEM DFT performance
AMD opteron benchmarks
AMD's ACML library or here or Opteron library libgoto
processor for Sandia's Red Storm