ORNL Opteron Evaluation (Dunigan)
.... this is work in progress.... last revised
The results presented here are from standard benchmarks and some
custom benchmarks and, as such, represent only one part of the
evaluation.
An IBM SP4, IBM Winterhawk II (noted as SP3 in the following tables
and graphs),Cray X1, SGI Altix (Itanium 2),
and Compaq Alpha ES40 at ORNL were used for comparison with the
Opteron.
The results below are in the following categories:
The
Opteron has 2 cpu's and 2 GB of memory.
The X1 at ORNL has 8 nodes. Each node has 4 MSPs, each MSP has 4 SSPs,
and each SSP has two vector units.
The Cray "processor/CPU" in the results below is one MSP.
All 4 MSP's on a node share memory.
The Power4 consists of one node with 16 processors (2 MCM's)
sharing memory.
Both the Alpha and SP3 consist of four processors sharing memory
on a single node.
The following table summarizes the main characteristics of
the machines
Specs: Alpha SC SP3 SP4 X1 Opteron Altix
MHz 667 375 1300 800 1600 1300
memory/node 2GB 2GB 32GB 16GB 2GB 512GB
L1 64K 64K 32K 16K 64k 32K
L2 8MB 8MB 1.5MB 2MB 1MB 256K
L3 128MB 3MB
peak Mflops 2*MHz 4*MHz 4*MHz 12.8 2*MHz 4*MHz
peak mem BW 5.2GBs 1.6GBs 200+ GBs ?200+GBS 5.3GBs 6.4 GBs
alpha 2 buses @ 2.6 GBs each
X1 memory bandwidth is 34 GB/s/CPU.
For the Alpha, nodes are interconnected with a
Quadrics switch
organized as a fat tree.
The SP nodes are interconnected with cross-bar switches
in an Omega-like network.
The X1 uses a modified 2-D torus.
We have used widely available benchmarks in combination with
our own custom benchmarks to characterize the performance
of the X1.
Some of the older benchmarks may need to be modified for these
newer faster machines -- increasing repetitions to avoid 0 elapsed times,
increasing problem sizes to test out of cache performance.
Unless otherwise noted, the following compiler switches
were used on the Alpha and SP.
opteron: -O3 (pgf90)
X1: -Oaggress,stream2 (arpun -n xxx -p 64k:16m a.out)
Alpha: -O4 -fast -arch ev6
SP: -O4 -qarch=auto -qtune=auto -qcache=auto -bmaxdata:0x70000000
Benchmarks were in C, FORTRAN, and FORTRAN90/OpenMP.
We also compared performance with the vendor runtime libraries, sci(X1),
cxml (Alpha)
and essl (SP).
We used the following benchmarks in our tests:
- ParkBench 2.1 --
provides low-level sequential and communication benchmarks,
parallel linear algebra benchmarks,
NAS parallel benchmarks,
and compact application codes.
Here is a summary of the benchmark modules.
Codes are in FORTRAN.
Results are often reported as least-squares fit of data.
We report actual performance numbers.
- EuroBen 3.9 -- provides
serial benchmarks for low-level performance and applicaton
kernels (linear algebra, eigen value, FFT, QR).
Here is a summary of the benchmark modules.
euroben-dm provides some communication and parallel (MPI)
benchmarks.
The web site includes results from other systems.
- lmbench --
provides insight into OS (UNIX) performance and memory latencies.
The web site includes results from other systems.
- stream --
measures memory bandwidth for both serial and parallel configurations.
Also we use the
MAPS memory benchmark.
The web sites include results from other systems.
- Custom low-level benchmarks that we have used over the years
in evaluating memory and communication performance.
For both the Alpha and the SP, gettimeofday() provides
microsecond wall-clock time (though one has to be sure MICROTIME
option is set in the Alpha OS kernel).
Both have high-resolution cylce counters as well, but the Alpha
cycle counter is only 32-bits so rolls over in less than 7 seconds.
For distributed benchmarks (MPI), the IBM and Alpha systems provide a hardware
synchronized MPI_Wtime() with microsecond resolution.
On the Alpha, MPI_Wtime is frequency synchonized, but initial
offsets are only approximate. (On the Alpha, it appears MPI_Init tries
to provide an initial zero offset to the Elan counters
on each node when an MPI job starts.
On the SP3, we discovered several nodes that were not synchronized, a patch
was eventually provided.)
Time is not syncrhonized on the X1.
The stream benchmark
is a program that measures main memory throughput for several
simple operations.
The aggregate data rate for multiple threads is reported
in the following table.
Recall, that the "peak" memory data rate for the
X1 is 200 GBs, Alpha is 5.2 GBs,
and for the SP3 is 1.6 GBs.
Data for the 16-way SP3 (375 Mhz, Nighthawk II)
is included too.
Data for the Alpha ES45 (1 GHz) is obtained from the
streams data base.
Data for p690/sp4 is with affinity enabled (6/1/02).
The X1 uses (aprun -A).
The Opteron is supposed to have greater than 5.3 GB/s/cpu memory bandwidth,
we don't see that yet?
MBs
copy scale add triad
opteron 1594 1757 1767 1915
2 cpus 2667 2667 3000 3000
altix 3214 3169 3800 3809
X1 22111 21634 23658 23752
alpha1 1339 1265 1273 1383
es45-1 1946 1941 1978 1978
SP3 1 523 561 581 583
SP3/16-1 486 494 601 601
SP4-1 1774 1860 2098 2119
From AMD's published spec benchmark and McCalpin's suggested
conversion of 171.swim results to triad memory bandwidth, we get 2.7 GB/s
memory bandwidth for one Opteron processor.
The
MAPS benchmark also characterizes memory access performance.
Plotted are load/store bandwidth for sequential (stride 1) and random
access.
The hint
benchmark measures computation and memory efficiency as
the problem size increases.
(This is C hint version 1, 1994.)
The following graph shows the performance of a single processor
for the Opteron (147 MQUIPS), X1 (12.2 MQUIPS),
Alpha (66.9 MQUIPS), Altix (88.2 MQUIPS), and SP4 (74.9 MQUIPS).
The L1 and L2 cache boundaries are visible, as well as the Altix and
SP4's L3.
Here results from LMbench.
LOW LEVEL BENCHMARKS (single processor)
|
The following table compares the performance of the X1, Alpha, and SP
for basic CPU operations.
These numbers are from the first few kernels of EuroBen's mod1ac.
The 14th kernel (9th degree poly)
is a rough estimate of peak FORTRAN performance since it
has a high re-use of operands.
(Revised 4/8/03)
alpha sp3 sp4 X1 opteron Altix
broadcast 516 368 1946 2483 631 2309
copy 324 295 991 2101 343 1526
addition 285 186 942 1957 262 839
subtraction 288 166 968 1946 254 852
multiply 287 166 935 2041 263 855
division 55 64 90 608 185 136
dotproduct 609 655 2059 3459 520 545
X=X+aY 526 497 1622 4134 793 1727
Z=X+aY 477 331 1938 3833 794 1719
y=x1x2+x3x4 433 371 2215 3713 751 1809
1st ord rec. 110 107 215 48 265 124
2nd ord rec. 136 61 268 46 352 179
2nd diff 633 743 1780 4960 956 2575
9th deg. poly 701 709 2729 10411 1110 5180
basic operations (Mflops) euroben mod1ac
The following table compares the performance of various intrinsics
(EuroBen mod1f).
For the SP, it also shows the effect of -O4 optimization versus -O3.
(Revised 4/8/03)
alpha sp3 -O4 sp3 -O3 sp4 -O4 X1 opteron altix
x**y 8.3 1.8 1.6 7.1 49 5.4 11.4
sin 13 34.8 8.9 64.1 97.9 10.9 19.9
cos 12.8 21.4 7.1 39.6 71.4 6.1 19.9
sqrt 45.7 52.1 34.1 93.9 711 66.6 137
exp 15.8 30.7 5.7 64.3 355 9.2 123
log 15.1 30.8 5.2 59.8 185 10.1 72.4
tan 9.9 18.9 5.5 35.7 85.4 11.9 18.3
asin 13.3 10.4 10.2 26.6 107 16.9 25.3
sinh 10.7 2.3 2.3 19.5 82.6 9.3 16.6
instrinsics (Mcalls/s) euroben mod1f (N=10000)
The following table compares the performance (Mflops) of a simple
FORTRAN matrix (REAL*8 400x400) multiply compared with the performance
of DGEMM from the vendor math library (-lcxml for the Alpha,
-lsci for the X1,
-lessl for the SP).
Note, the SP4 -lessl (3.3) is tuned for the Power4.
Also the Mflops for 1000x1000 Linpack are reported
from netlib
except the sp4 number is from
IBM.
(Revised 4/8/03)
alpha sp3 sp4 X1 opteron altix
ftn 72 45 220 7562 110 205
lib 1182 1321 3174 9482 2778 4591
linpack 1031 1236 2894 3955
The following plot compares the performance of
the scientific library DGEMM.
We also compare libgoto
AMD's -lacml library.
(Revised 8/27/03).
The following table compares the single
processor performance (Mflops) of the
Alpha and IBMs for the Euroben mod2g,
a 2-D Haar wavelet transform test.
(Revised 4/8/03)
|--------------------------------------------------------------------------
| Order | alpha | altix | SP4 | X1 | opteron |
| n1 | n2 | (Mflop/s) | (Mflop/s) | (Mflop/s) | (Mflop/s)|(Mflop/s)|
|--------------------------------------------------------------------------
| 16 | 16 | 142.56 | 130.9 | 126.42 | 10.5 | 221 |
| 32 | 16 | 166.61 | 165.2 | 251.93 | 13.8 | 256 |
| 32 | 32 | 208.06 | 218.9 | 301.15 | 20.0 | 293 |
| 64 | 32 | 146.16 | 208.77 | 297.26 | 22.7 | 293 |
| 64 | 64 | 111.46 | 199.08 | 278.45 | 25.9 | 271 |
| 128 | 64 | 114.93 | 240.10 | 251.90 | 33.3 | 241 |
| 128 | 128 | 104.46 | 282.61 | 244.45 | 48.5 | 197 |
| 256 | 128 | 86.869 | 186.84 | 179.43 | 45.8 | 150 |
| 256 | 256 | 71.033 | 120.53 | 103.52 | 46.7 | 101 |
| 512 | 256 | 65.295 | 142.83 | 78.435 | 52.1 | 83 |
|--------------------------------------------------------------------------
The following plots the performance (Mflops) of
Euroben mod2b, a dense linear system test,
for both optimized FORTRAN and using the BLAS from the vendor library (cxml/essl).
The following plots the performance (Mflops) of
Euroben mod2d, a dense eigenvalue test,
for both optimized FORTRAN and using the BLAS from the vendor library.
For the Alpha, -O4 optimization failed, so this data uses -O3.
The following plots the performance (iterations/second) of
Euroben mod2e, a sparse eigenvalue test.
(Revised 4/9/03)
The following figure shows the FORTRAN Mflops for one processor for various
problem sizes
for the EuroBen mod2f, a 1-D FFT.
Data access is irregular, but cache effects are still apparent.
(Revised 4/10/03).
MESSAGE-PASSING BENCHMARKS
|
NWCHEM DFT
performance
AMD opteron benchmarks
AMD's ACML library
or here
or Opteron library libgoto
processor for Sandia's
Red Storm