ORNL Opteron Evaluation
.... this is work in progress.... last revised
Revised for 1.8 GHz Opteron 4/9/04.
Results for older opteron 1.6GHz
The results presented here are from standard benchmarks and some
custom benchmarks and, as such, represent only one part of the
evaluation.
An IBM SP4, IBM Winterhawk II (noted as SP3 in the following tables
and graphs),Cray X1, SGI Altix (Itanium 2),
and Compaq Alpha ES40 at ORNL were used for comparison with the
Opteron.
The results below are in the following categories:
The
Opteron has 4 cpu's and 16 GB of memory (1.8 GHz).
The X1 at ORNL has 8 nodes. Each node has 4 MSPs, each MSP has 4 SSPs,
and each SSP has two vector units.
The Cray "processor/CPU" in the results below is one MSP.
All 4 MSP's on a node share memory.
The Power4 consists of one node with 32 processors
sharing memory.
Both the Alpha and SP3 consist of four processors sharing memory
on a single node.
All memory is accessible on the Altix.
The following table summarizes the main characteristics of
the machines
Specs: Alpha SC SP3 SP4 X1 Opteron Altix
MHz 667 375 1300 800 1800 1500
memory/node 2GB 2GB 32GB 16GB 16GB 512GB
L1 64K 64K 32K 16K 64k 32K
L2 8MB 8MB 1.5MB 2MB 1MB 256K
L3 128MB 6MB
peak Mflops 2*MHz 4*MHz 4*MHz 12.8 2*MHz 4*MHz
peak mem BW 5.2GBs 1.6GBs 200+ GBs ?200+GBS 5.3GBs 6.4 GBs
alpha 2 buses @ 2.6 GBs each
X1 memory bandwidth is 34 GB/s/CPU.
For the Alpha, nodes are interconnected with a
Quadrics switch
organized as a fat tree.
The SP nodes are interconnected with cross-bar switches
in an Omega-like network.
The X1 uses a modified 2-D torus.
We have used widely available benchmarks in combination with
our own custom benchmarks to characterize the performance
of the X1.
Some of the older benchmarks may need to be modified for these
newer faster machines -- increasing repetitions to avoid 0 elapsed times,
increasing problem sizes to test out of cache performance.
Unless otherwise noted, the following compiler switches
were used on the Alpha and SP.
opteron: -O3 (pgf90 v5.1))
X1: -Oaggress,stream2 (arpun -n xxx -p 64k:16m a.out)
Alpha: -O4 -fast -arch ev6
SP: -O4 -qarch=auto -qtune=auto -qcache=auto -bmaxdata:0x70000000
Benchmarks were in C, FORTRAN, and FORTRAN90/OpenMP.
We also compared performance with the vendor runtime libraries, sci(X1),
cxml (Alpha)
and essl (SP).
We used the following benchmarks in our tests:
- ParkBench 2.1 --
provides low-level sequential and communication benchmarks,
parallel linear algebra benchmarks,
NAS parallel benchmarks,
and compact application codes.
Here is a summary of the benchmark modules.
Codes are in FORTRAN.
Results are often reported as least-squares fit of data.
We report actual performance numbers.
- EuroBen 3.9 -- provides
serial benchmarks for low-level performance and applicaton
kernels (linear algebra, eigen value, FFT, QR).
Here is a summary of the benchmark modules.
euroben-dm provides some communication and parallel (MPI)
benchmarks.
The web site includes results from other systems.
- lmbench --
provides insight into OS (UNIX) performance and memory latencies.
The web site includes results from other systems.
- stream --
measures memory bandwidth for both serial and parallel configurations.
Also we use the
MAPS memory benchmark.
The web sites include results from other systems.
- Custom low-level benchmarks that we have used over the years
in evaluating memory and communication performance.
For both the Alpha and the SP, gettimeofday() provides
microsecond wall-clock time (though one has to be sure MICROTIME
option is set in the Alpha OS kernel).
Both have high-resolution cylce counters as well, but the Alpha
cycle counter is only 32-bits so rolls over in less than 7 seconds.
For distributed benchmarks (MPI), the IBM and Alpha systems provide a hardware
synchronized MPI_Wtime() with microsecond resolution.
On the Alpha, MPI_Wtime is frequency synchonized, but initial
offsets are only approximate. (On the Alpha, it appears MPI_Init tries
to provide an initial zero offset to the Elan counters
on each node when an MPI job starts.
On the SP3, we discovered several nodes that were not synchronized, a patch
was eventually provided.)
Time is not syncrhonized on the X1.
The stream benchmark
is a program that measures main memory throughput for several
simple operations.
The aggregate data rate for multiple threads is reported
in the following table.
Recall, that the "peak" memory data rate for the
X1 is 200 GBs, Alpha is 5.2 GBs,
and for the SP3 is 1.6 GBs.
Data for the 16-way SP3 (375 Mhz, Nighthawk II)
is included too.
Data for the Alpha ES45 (1 GHz) is obtained from the
streams data base.
Data for p690/sp4 is with affinity enabled (6/1/02).
The X1 uses (aprun -A).
The Opteron is supposed to have greater than 5.3 GB/s/cpu memory bandwidth,
we don't see that yet?
MBs
copy scale add triad
opteron 1975 1747 1945 2018
2 cpus 2539 2623 2892 3117
4 cpus 4848 5000 5714 6316
altix 3214 3169 3800 3809
X1 22111 21634 23658 23752
alpha1 1339 1265 1273 1383
es45-1 1946 1941 1978 1978
SP3 1 523 561 581 583
SP3/16-1 486 494 601 601
SP4-1 1774 1860 2098 2119
From AMD's published spec benchmark and McCalpin's suggested
conversion of 171.swim results to triad memory bandwidth, we get 2.7 GB/s
memory bandwidth for one Opteron processor.
The
MAPS benchmark also characterizes memory access performance.
Plotted are load/store bandwidth for sequential (stride 1) and random
access.
The tabletoy benchmark (C) makes random writes of 64-bit integers in
a shared memory,
parallelization is permitted with possibly non-coherent updates.
The X1 number is for vectorizing the inner loop (multistreaming
was an order of magnitude slower 88 MBs).
Data rate in the following table is for a 268MB table.
We include multi-threaded altix, opteron, sp3 (NERSC), and sp4 data as well.
Revised 2/9/04
MBs (using wallclock time)
sp4-1 26 altix-1 42 X1-msp-1 1190 opteron-1 36 sp3-1 8
sp4-2 47 altix-2 45 opteron-2 65 sp3-2 26
sp4-4 98 altix-4 62 opteron-4 102 sp3-4 53
sp4-8 174 altix-8 86 sp3-8 90
sp4-16 266 altix-16 69 sp3-16 139
sp4-32 322 altix-32 77
The hint
benchmark measures computation and memory efficiency as
the problem size increases.
(This is C hint version 1, 1994.)
The following graph shows the performance of a single processor
for the Opteron (176 MQUIPS), X1 (12.2 MQUIPS),
Alpha (66.9 MQUIPS), Altix (88.2 MQUIPS), and SP4 (74.9 MQUIPS).
The L1 and L2 cache boundaries are visible, as well as the Altix and
SP4's L3.
Here results from LMbench.
LOW LEVEL BENCHMARKS (single processor)
|
The following table compares the performance of the X1, Alpha, and SP
for basic CPU operations.
These numbers are from the first few kernels of EuroBen's mod1ac.
The 14th kernel (9th degree poly)
is a rough estimate of peak FORTRAN performance since it
has a high re-use of operands.
(Revised 2/9/04)
alpha sp3 sp4 X1 opteron Altix
broadcast 516 368 1946 2483 882 2553
copy 324 295 991 2101 591 1758
addition 285 186 942 1957 589 1271
subtraction 288 166 968 1946 589 1307
multiply 287 166 935 2041 590 1310
division 55 64 90 608 223 213
dotproduct 609 655 2059 3459 593 724
X=X+aY 526 497 1622 4134 884 2707
Z=X+aY 477 331 1938 3833 884 2632
y=x1x2+x3x4 433 371 2215 3713 1319 2407
1st ord rec. 110 107 215 48 143 142
2nd ord rec. 136 61 268 46 208 206
2nd diff 633 743 1780 4960 1313 2963
9th deg. poly 701 709 2729 10411 1197 5967
basic operations (Mflops) euroben mod1ac
The following table compares the performance of various intrinsics
(EuroBen mod1f).
For the SP, it also shows the effect of -O4 optimization versus -O3.
(Revised 2/9/04)
alpha sp3 -O4 sp3 -O3 sp4 -O4 X1 opteron altix
x**y 8.3 1.8 1.6 7.1 49 6.0 13.2
sin 13 34.8 8.9 64.1 97.9 26.7 22.9
cos 12.8 21.4 7.1 39.6 71.4 17.7 22.9
sqrt 45.7 52.1 34.1 93.9 711 74.6 107
exp 15.8 30.7 5.7 64.3 355 21.8 137
log 15.1 30.8 5.2 59.8 185 31.8 88.5
tan 9.9 18.9 5.5 35.7 85.4 21.8 21.1
asin 13.3 10.4 10.2 26.6 107 13.4 29.2
sinh 10.7 2.3 2.3 19.5 82.6 14.8 19.1
instrinsics (Mcalls/s) euroben mod1f (N=10000)
The following table compares the performance (Mflops) of a simple
FORTRAN matrix (REAL*8 400x400) multiply compared with the performance
of DGEMM from the vendor math library (-lcxml for the Alpha,
-lsci for the X1,
-lessl for the SP).
Note, the SP4 -lessl (3.3) is tuned for the Power4.
Also the Mflops for 1000x1000 Linpack are reported
from netlib
except the sp4 number is from
IBM.
(Revised 2/9/04)
alpha sp3 sp4 X1 opteron altix
ftn 72 45 220 7562 287 228
lib 1182 1321 3174 9482 2610 5222
linpack 1031 1236 2894 3955
The following plot compares the performance of
the scientific library DGEMM.
We also compare libgoto
AMD's -lacml library.
(Revised 2/9/04).
The following plot compares the DAXPY performance of the Opteron and Itanium
(Altix).
The following table compares the single
processor performance (Mflops) of the
Alpha and IBMs for the Euroben mod2g,
a 2-D Haar wavelet transform test.
(Revised 2/9/04)
|--------------------------------------------------------------------------
| Order | alpha | altix | SP4 | X1 | opteron |
| n1 | n2 | (Mflop/s) | (Mflop/s) | (Mflop/s) | (Mflop/s)|(Mflop/s)|
|--------------------------------------------------------------------------
| 16 | 16 | 142.56 | 150.4 | 126.42 | 10.5 | 341 |
| 32 | 16 | 166.61 | 192.1 | 251.93 | 13.8 | 383 |
| 32 | 32 | 208.06 | 262.3 | 301.15 | 20.0 | 437 |
| 64 | 32 | 146.16 | 252.7 | 297.26 | 22.7 | 437 |
| 64 | 64 | 111.46 | 242.5 | 278.45 | 25.9 | 387 |
| 128 | 64 | 114.93 | 295.6 | 251.90 | 33.3 | 342 |
| 128 | 128 | 104.46 | 350.2 | 244.45 | 48.5 | 264 |
| 256 | 128 | 86.869 | 211.2 | 179.43 | 45.8 | 198 |
| 256 | 256 | 71.033 | 133.3 | 103.52 | 46.7 | 129 |
| 512 | 256 | 65.295 | 168.7 | 78.435 | 52.1 | 99 |
|--------------------------------------------------------------------------
The following plots the performance (Mflops) of
Euroben mod2b, a dense linear system test,
for both optimized FORTRAN and using the BLAS from the vendor library (cxml/essl).
The following plots the performance (Mflops) of
Euroben mod2d, a dense eigenvalue test,
for both optimized FORTRAN and using the BLAS from the vendor library.
For the Alpha, -O4 optimization failed, so this data uses -O3.
The following plots the performance (iterations/second) of
Euroben mod2e, a sparse eigenvalue test.
(Revised 2/9/04)
The following figure shows the FORTRAN Mflops for one processor for various
problem sizes
for the EuroBen mod2f, a 1-D FFT.
Data access is irregular, but cache effects are still apparent.
(Revised 2/9/04).
The following compares a 1-D FFT using the
FFTW benchmark.
The following graph plots 1-D FFT performance using the vendor
library (-lacml, -lscs, -lsci or -lessl), initialization time is not included.
Revised 2/9/04
Both the Alpha and IBMs consist of a cluster of shared-memory nodes, each node with four processors sharing a common memory (16 for X1 and 32 for sp4).
The X1 is cache-coherent within a node, but the memory space
is global across all nodes.
The Opteron shares memory among 4 processors.
We tested the performance of a shared-memory node with various C programs with explicit thread calls and with FORTRAN Open MP codes.
The X1 pthreads model permits up to 16 SSP threads (-h ssp) or
4 MSP threads where each thread can also be multithreaded on each MSP.
The following table shows the performance of thread/join in C as the
master thread creates two, three, and four threads.
The test repeatedly creates and joins threads.
Revised 2/24/04.
threads alpha sp3 sp4 altix x1 opteron
2 47.7 96 44 399 29695 41
3 165 152 68 842 55439 82
4 251 222 97 1241 79180 126
thread create/join time in microseconds (C)
Often, it is more efficient to create the threads once, and then
provide them work as needed.
I suspect this is what FORTRAN Open MP is doing for "parallel do".
The following table is the performance of parallel do.
threads alpha sp3 sp4 altix x1
2 2.1 12.7 6.3 4.8 12.1
3 3.4 15.3 8.4 6.3 13.2
4 5.2 19.5 9.5 6.5 17.4
OPEN MP parallel DO (us)
The following table shows the time required to lock-unlock
using pthread_mutex_lock with various number of threads.
For the IBMs we use setenv SPINLOOPTIME 5000.
threads alpha sp3 sp4 altix x1 opteron
1 0.26 0.6 0.3 0.07 4.9 0.06
2 1.5 1.4 1.3 2.6 295 11.5
3 17.8 2.1 1.6 41.5 1317 17.5
4 29.6 2.9 3.8 73.2 1703 24.1
time for lock/unlock (us)
The graph to the right shows the time to lock/unlock a systemV semaphore
when competing with other processors.
The following table compares the performance of simple C barrier
program using a single lock and spinning on a shared variable
along with pthread_yield.
threads alpha sp3 sp4 altix x1 opteron
1 0.25 0.6 0.3 0.08 4.9 0.05
2 1.36 4.4 1.9 0.5 97.7 3.3
3 9.9 20.5 3.1 18.1 95.2 10.6
4 65 34.6 3.7 53.2 99.3 2.4
C barrier times (us)
The following table illustrates linear speedup for an embarrassingly
parallel integration.
A C code with explicit thread management is compared with FORTRAN
Open MP.
Both just used -O optimization.
The CRAY C pthread implementation does not scale well for SSPs or MSPs.
fortran OpenMP C threads
alpha sp3 sp4 altix x1 | alpha sp3 sp4 altix x1 msp opteron
1 252 102 251 891 676 | 166 52 216 558 669 2480 311
2 502 204 501 1775 1354 | 331 104 432 1114 1154 2957 621
3 748 306 752 2312 2026 | 496 157 648 1668 1513 2884 921
4 990 408 1002 3519 2695 | 657 206 864 2221 1732 2805 1214
8 1999 6815 5336 | 1725 4410 1789
16 3565 12039 10470 | 3429 8580 1254
rectangle rule (Mflops) -O optimization
The following table illustrates an explicit thread implementation
of Cholesky factorization of a 1000x1000 double precision
matrix in C (-O optimization).
threads alpha sp3 sp4 altix x1 msp opteron
1 150 125 350 196 525 476 285
2 269 238 631 341 848 733 481
3 369 353 1007 512 1096 942 552
4 435 390 1306 621 797 1087 722
cholp 1k matrix factor (mflops) -O optimization
Opteron architecture
NWCHEM DFT
performance
AMD opteron benchmarks
AMD's ACML library
or here
or Opteron library libgoto
Opteron bios and kernel developer's guide
papi
hpet
high precision timers
and rdtsc timers
and dclock
processor for Sandia's
Red Storm