ORNL Compaq Alpha/ IBM SP evaluation
Most of this data was collected summer of 2000.
Last Modified
Oak Ridge National Laboratory (ORNL) is currently performing an in-depth
evaluation of the Compaq AlphaServer SC parallel architecture
as part of its
evaluation of early systems project.
The primary tasks of the evaluation are to
- determine the most effective approaches to using the AlphaServer SC;
- evaluate benchmark and application performance, and compare with similar
systems from other vendors;
- evaluate system and system administration software reliability and
performance;
- predict scalability, both in terms of problem size and in number of
processors.
The emphasis of the evaluation is on application-relevant studies
for applications of importance to DOE.
However, standard benchmarks are still important for comparisons
with other systems.
The results presented here are from standard benchmarks and some
custom benchmarks and, as such, represent only one part of the
evaluation.
A large IBM SP3 at ORNL was used for comparison with the Alpha
in the results presented below.
The results below are in the following categories:
ARCHITECTURE
Both the Alpha and SP consist of four processors sharing memory
on a single node.
The following table summarizes the main characteristics of
the SP and Alpha.
Specs: Alpha SC SP3
MHz 667 375
memory/node 2GB 2GB
L1 64K 64K
L2 8MB 8MB
peak Mflops 2*MHz 4*MHz
peak mem BW 5.2GBs 1.6GBs
alpha 2 buses @ 2.6 GBs each
For the Alpha, nodes are interconnected with a
Quadrics switch
organized as a fat tree.
The SP nodes are interconnected with cross-bar switches
in an Omega-like network.
BENCHMARKS
We have used widely available benchmarks in combination with
our own custom benchmarks to characterize the performance
of the Alpha SC cluster.
Some of the older benchmarks may need to be modified for these
newer faster machines -- increasing repetitions to avoid 0 elapsed times,
increasing problem sizes to test out of cache performance.
Unless otherwise noted, the following compiler switches
were used on the Alpha and SP.
Alpha: -O4 -fast -arch ev6
SP: -O4 -qarch=auto -qtune=auto -qcache=auto -bmaxdata:0x70000000
Benchmarks were in C, FORTRAN, and FORTRAN90/OpenMP.
We also compared performance with the vendor runtime libraries, cxml (Alpha)
and essl (SP).
We used the following benchmarks in our tests:
- ParkBench 2.1 --
provides low-level sequential and communication benchmarks,
parallel linear algebra benchmarks,
NAS parallel benchmarks,
and compact application codes.
Here is a summary of the benchmark modules.
Codes are in FORTRAN.
Results are often reported as least-squares fit of data.
We report actual performance numbers.
- EuroBen 3.9 -- provides
serial benchmarks for low-level performance and applicaton
kernels (linear algebra, eigen value, FFT, QR).
Here is a summary of the benchmark modules.
euroben-dm provides some communication and parallel (MPI)
benchmarks.
The web site includes results from other systems.
- lmbench --
provides insight into OS (UNIX) performance and memory latencies.
The web site includes results from other systems.
- stream --
measures memory bandwidth for both serial and parallel configurations.
The web site includes results from other systems.
- Custom low-level benchmarks that we have used over the years
in evaluating memory and communication performance.
For both the Alpha and the SP, gettimeofday() provides
microsecond wall-clock time (though one has to be sure MICROTIME
option is set in the Alpha OS kernel).
Both have high-resolution cylce counters as well, but the Alpha
cycle counter is only 32-bits so rolls over in less than 7 seconds.
For distributed benchmarks (MPI), both systems provide a hardware
synchronized MPI_Wtime() with microsecond resolution.
On the Alpha, MPI_Wtime is frequency synchonized, but initial
offsets are only approximate. (On the Alpha, it appears MPI_Init tries
to provide an initial zero offset to the Elan counters
on each node when an MPI job starts.
On the SP, we discovered several nodes that were not synchronized, a patch
was eventually provided.)
LOW LEVEL BENCHMARKS
The following table compares the performance of the Alpha and SP
for basic CPU operations.
These numbers are from the first 14 kernels of EuroBen's mod1ac.
The 14th kernel is a rough estimate of peak FORTRAN performance since it
has a high re-use of operands.
alpha sp
broadcast 516 368
copy 324 295
addition 285 186
subtraction 288 166
multiply 287 166
division 55 64
dotproduct 609 655
X=X+aY 526 497
Z=X+aY 477 331
y=x1x2+x3x4 433 371
1st ord rec. 110 107
2nd ord rec. 136 61
2nd diff 633 743
9th deg. poly 701 709
basic operations (Mflops) euroben mod1ac
The following table compares the performance of various intrinsics
(EuroBen mod1f).
For the SP, it also shows the effect of -O4 optimization versus -O3.
alpha sp -O4 sp -O3
x**y 8.3 1.8 1.6
sin 13 34.8 8.9
cos 12.8 21.4 7.1
sqrt 45.7 52.1 34.1
exp 15.8 30.7 5.7
log 15.1 30.8 5.2
tan 9.9 18.9 5.5
asin 13.3 10.4 10.2
sinh 10.7 2.3 2.3
instrinsics (Mcalls/s) euroben mod1f (N=10000)
The following table compares the performance (Mflops) of a simple
FORTRAN matrix (REAL*8 400x400) multiply compared with the performance
of DGEMM from the vendor math library (-lcxml for the Alpha,
-lessl for the SP).
Also the Mflops for 1000x1000 Linpack are reported
from netlib.
alpha sp
ftn 71.7 45.2
lib 1181.5 1320.5
linpack 1031 1236
In the following graph, the performance of the
ATLAS DGEMM is compared with the vendor libraries.
The following table compares
optimized FORTRAN performance for Euroben mod2a,
matrix-vector dot product and product.
--------------------------------------------------------------
alpha sp alpha sp
Problem size| MxV-ddot | MxV-ddot | MxV-axpy | MxV-axpy |
m | n | (Mflop/s) | (Mflop/s) | (Mflop/s) | (Mflop/s) |
-------------------------------------------------------------
100 | 100 | 411.7 | 423.9 | 101.9 | 401.9 |
200 | 200 | 442.3 | 416.8 | 227.4 | 421.1 |
500 | 500 | 66.1 | 18.7 | 205.4 | 411.9 |
1000 | 1000 | 31.8 | 17.1 | 205.6 | 274.5 |
2000 | 2000 | 27.5 | 16.1 | 66.9 | 207.9 |
--------------------------------------------------------------
The following table compares the single
processor performance (Mflops) of the
Alpha and SP for the Euroben mod2g,
a 2-D Haar wavelet transform test.
|-----------------------------------------
| Order | alpha | SP |
| n1 | n2 | (Mflop/s) | (Mflop/s) |
|-----------------------------------------
| 16 | 16 | 142.56 | 79.629 |
| 32 | 16 | 166.61 | 96.690 |
| 32 | 32 | 208.06 | 115.43 |
| 64 | 32 | 146.16 | 108.74 |
| 64 | 64 | 111.46 | 111.46 |
| 128 | 64 | 114.93 | 101.49 |
| 128 | 128 | 104.46 | 97.785 |
| 256 | 128 | 86.869 | 64.246 |
| 256 | 256 | 71.033 | 44.159 |
| 512 | 256 | 65.295 | 41.964 |
|-----------------------------------------
The following plots the performance (Mflops) of
Euroben mod2b, a dense linear system test,
for both optimized FORTRAN and using the BLAS from the vendor library (cxml/essl).
The following plots the performance (Mflops) of
Euroben mod2d, a dense eigenvalue test,
for both optimized FORTRAN and using the BLAS from the vendor library.
For the Alpha, -O4 optimization failed, so this data uses -O3.
The following plots the performance (iterations/second) of
Euroben mod2e, a sparse eigenvalue test.
Memory performance
Both the SP and the Alpha have 64 KB L1 caches and 8 MB L2 caches.
The following figure shows the data rates for a simple FORTRAN
loop to load ( y = y+x(i)), store (y(i)=1), and
copy (y(i)=x(i)), for different vector sizes.
Data is also included for four threads.
At the tail end of the graph above, the program starts fetching
data from main memory.
For load, a single Alpha thread is reading data at 1.7 GBs,
the SP at 787 MBs.
For four threads, the load per-cpu rate drops to 811 MBs for
the Alpha and 322 MBs for the SP.
The aggregate rate for 4 CPUs from the test is then
3.2 GBs for the Alpha compared to 1.3 GBs for the SP.
The stream benchmark
is a program that measures main memory throughput for several
simple operations.
The following table shows the memory data rates for a single
processor.
Stream 1 CPU alpha sp3
Function Rate (MB/s)
Copy: 1090.6601 598.9804
Scale: 997.5083 576.2223
Add: 1058.0155 770.8110
Triad: 1133.4106 780.0816
stream (C) memory throughput
The aggregate data rate for multiple threads is reported
in the following table (input arguments: threads*2000000,0,10).
Recall, that the "peak" data rate for the Alpha is 5.2 GBs
and for the SP is 1.6 GBs.
copy scale add triad ddot x+y
alpha1 1339 1265 1273 1383 1376 1115
alpha2 1768 1711 1839 1886 1852 1729
alpha3 2279 2280 2257 2308 2526 1931
alpha4 2375 2323 2370 2427 3098 2125
SP 1 523 561 581 583 1080 729
SP 2 686 797 813 909 1262 923
SP 3 833 805 897 914 1282 942
SP 4 824 799 889 914 1272 927
stream (f90/omp) multiple threads (aggregate MB/sec)
IBM provided a modifed parallel stream.f that allocates the
memory a little differently and uses some pre_load IBM directives.
This version gets improved performance as indicated in the following
table.
copy scale add triad
SP 1 827 799 862 891
SP 2 869 824 886 926
SP 3 891 822 878 918
SP 4 864 809 880 918
The following figure shows the Mflops for one processor for various
problem sizes
for the EuroBen mod2f, a 1-D FFT.
Data access is irregular, but cache boundaries are still apparent.
The hint
benchmark measures computation and memory efficiency as
the problem size increases.
The following graph shows the performance of a single processor
for the Alpha (66.9 MQUIPS) and SP (27.3 MQUIPS).
The L1 and L2 cache boundaries are visible.
The
lmbench benchmark
measures various UNIX and system characeristics.
Here are some preliminary numbers
for runs on a service and compute node of alpha and SP.
The cache/memory latencies reported by lmbench are
alpha sp3
L1 4 5
L2 27 32
memory 210 300
latency in nanoseconds
Open/close times are much slower for the Alpha, though file create/delete
are faster on the Alpha.
EuroBen's mod3a tests matrix computation with file I/O (out
of core).
The following two tables compare the Alpha and SP.
No attempt was made to optimize I/O performance.
Mod3a: Out-of-core Matrix-vector multiplication
Alpha
--------------------------------------------------------------------------
Row | Column | Exec. time | Mflop rate | Read rate | Write rate |
(n) | (m) | (sec) | (Mflop/s) | (MB/s) | (MB/s) |
--------------------------------------------------------------------------
25000 | 20000 | 0.56200E-01| 17.793 | 153.63 | 33.945 |
50000 | 20000 | 0.13700 | 14.598 | 117.32 | 35.905 |
100000 | 100000 | 0.67409 | 29.668 | 141.19 | 35.884 |
250000 | 100000 | 2.6982 | 18.531 | 117.61 | 35.770 |
--------------------------------------------------------------------------
SP
--------------------------------------------------------------------------
25000 | 20000 | .81841 | 1.2219 | 244.76 | .27172 |
50000 | 20000 | 1.6479 | 1.2136 | 244.61 | .26217 |
100000 | 100000 | 1.4766 | 13.544 | 241.12 | .84673 |
250000 | 100000 | 3.6024 | 13.879 | 239.51 | 1.1294 |
--------------------------------------------------------------------------
Others have made more rigorous tests of the regular and parallel file
systems.
SHARED-MEMORY BENCHMARKS
Both the Alpha and SP consist of a cluster of shared-memory nodes,
each node with four processors sharing a common memory.
We tested the performance of a shared-memory node with various
C programs with explicit thread calls and with FORTRAN Open MP codes.
The following table shows the performance of thread/join in C as the
master thread creates two, three, and four threads.
The test repeatedly creates and joins threads.
threads alpha SP
2 47.7 96
3 165 152
4 251 222
thread create/join time in microseconds (C)
Often, it is more efficient to create the threads once, and then
provide them work as needed.
I suspect this is what FORTRAN Open MP is doing for "parallel do".
The following table is the performance of parallel do.
threads alpha SP
2 2.1 12.7
3 3.4 15.3
4 5.2 19.5
OPEN MP parallel DO (us)
Notice that the performance is much better than the explicit thread calls.
The following table shows the time required to lock-unlock
using pthread_mutex_lock with various number of threads.
threads alpha sp
1 0.26 0.57
2 1.5 1.7
3 17.8 7.6
4 29.6 15.6
time for lock/unlock (us)
The following table compares the performance of simple C barrier
program using a single lock and spinning on a shared variable
along with pthread_yield.
A version based on condition variables was an order of magnitude slower.
threads alpha sp
1 0.25 0.6
2 1.36 4.4
3 9.9 20.5
4 65 353
C barrier times (us)
The following table illustrates linear speedup for an embarrassingly
parallel integration.
A C code with explicit thread management is compared with FORTRAN
Open MP.
Both just used -O optimization.
fortran C threads
alpha SP alpha SP
1 252 102 166 52
2 502 204 331 104
3 748 306 496 157
4 990 408 657 206
rectangle rule (Mflops) -O optimization
The following table illustrates an explicit thread implementation
of Cholesky factorization of a 1000x1000 double precision
matrix in C (-O optimization).
threads alpha sp
1 150 125
2 269 238
3 369 353
4 435 390
cholp 1k matrix factor (mflops) -O optimization
The following table compares FORTRAN OpenMP for the Alpha and SP
doing a simple Jacobi iteration.
Note that the SP slows for 4 threads.
problem size 10K 250K 1M
threads alpha sp alpha sp alpha sp
1 4308 3656 175 114 27 17
2 8262 5707 342 284 42 27
3 11603 7048 503 421 50 41
4 14109 4690 655 324 61 41
iterations per second
MESSAGE-PASSING BENCHMARKS
Internode communication can be accomplished with IP, PVM, or MPI.
We report MPI performance over the Alpha Quadrics network and the
IBM SP.
Each node (4 CPUs) share a single network interface.
However, each CPU is a unique MPI end point, so one can measure
both inter-node and intra-node communication.
The following table summarizes the measured communication characteristics
of the Alpha and the SP.
alpha sp3
latency (1 way, us) 5.4 16.3
bandwidth (echo, MBs) 199 139
(exchange, MBs) 167 180
MPI within a node 622 512
latency (min, 1 way, us) and bandwidth (MBs)
-- latency Bandwidth (min 1 way us, MBs)
alpha node 5.5 198
alpha cpu 5.8 623
alpha IP-sw 123 77
alpha IP-gigE/1500 76 44
alpha IP-100E 70 11
sp node 16.3 139
sp cpu 8.1 512
sp IP-sw 82 46
sp IP-gigE/1500 91 47
sp IP-gigE/9000 136 84
sp IP-100E 93 12
For comparisons, the communication performance of the Alpha and SP3
are plotted in the following graph from
Dongarra and Dunigan,
``Message-Passing Performance of Various Computers' (1997).
We also measured the shem_put latency between two alpha nodes
to be 3.2 microseconds for 4 bytes.
The following graph shows the bandwidth for communication between
two nodes using MPI.
Data is from both EuroBen's mod1h and ParkBench comms1.
The following graph shows bandwidth for communication between two
processors on the same node using MPI.
The SP performs better for smaller messages.
In the following, we just plot Alpha intra-node with inter-node
bandwidth.
Notice that for small message sizes on the ALPHA,
it is faster to pass messages
between nodes than between two cpu's on the same node.
We also measured the jitter in round-trip times on the Alpha
and SP between neighboring nodes and distant nodes.
The Alpha shows less variation in round-trip times than the SP.
A test between node 1 and node 150 on the SP also shows slightly
longer round-trip time and increased jitter.
Click here to see a plot of the jitter.
Jitter between node 1 and node 64 on our Alpha was not noticably
different than between two adjacent nodes.
We measured the bidirectional bandwidth between nodes (and intra-node)
using MPI_Sendrecv.
(MPI_Irecv produced the same performance.)
The interconnect fabrics for the Alpha and the SP support full
bidirectional bandwidth, but as illustrated in the following graph,
software and NIC's limit measured performance.
For the Alpha, the exchange bandwidth (167 MBs) is less than the
unidirectional bandwdith (200 MBs).
As noted before, Alpha's internode performance surpasses intranode
performance for small messages.
For large messages, the SP outperforms the Alpha for internode exchanges.
We measured the bisection bandwidth between nodes by streaming one
megabyte messages from the lower half of the nodes to the upper half.
The following table shows that the SP shows some contention as
more nodes participate.
(Caution, these results have not met our quality assurance standards yet...
lot of variation in SP numbers with other jobs running?)
per node pair aggregate
nodes alpha sp alpha sp
2 195.8 138.8 196 139
4 194.2 138.7 388 277
8 188.2 138 753 552
16 188.3 132 1560 1056
32 179.8 130 2877 2080
48 169.3 128 4063 3072
56 157.1 112 4399 3136
64 171.1 124 5475 3968
average bisection bandwith (MBs) preliminary
Since all four processors on node share one network interface and
since one processor can saturate the network interface, multiple
processors sending concurrently off the node will only get a portion of the
available bandwidth.
Two processors sending to the other two processors on the same node
get an aggregate throughput of 727 MBs on the alpha (914 MBs on the sp).
The following table shows the performance of aggregate communication
operations (barrier, broadcast, sum-reduction) using one processor
per node (N) and all four processors on each node(n).
Times are in microseconds.
mpibarrier (average us)
cpus alpha-N alpha-n sp-N sp-n
2 7 11 22 10
4 7 16 45 20
8 8 18 69 157
16 9 21 93 230
32 11 28 118 329
64 37 145 419
mpibcast (8 bytes)
cpus alpha-N alpha-n sp-N sp-n
2 9.6 12.5 5.4 6.7
4 10.4 20.3 9.4 9.4
8 11.4 28.5 13.4 17.5
16 12.5 32.9 17.0 20.9
32 13.8 41.4 19.3 24.1
64 48.7 23.6 30.8
mpireduce (SUM, doubleword)
cpus alpha-N alpha-n sp-N sp-n
2 9 11 8 9
4 190 207 29 133
8 623 350 271 484
16 1117 604 683 1132
32 3176 1991 1613 2193
64 5921 2841 3449
PARALLEL KERNEL BENCHMARKS
Both ParkBench and EuroBen (euroben-dm) had MPI-based parallel
kernels.
However, the euroben-dm communication model was to have the processes
do all of their send's before issuing receive's.
On the SP, this model resulted in deadlock for the larger problem
sizes.
The EAGER_LIMIT can be adjusted to make some progress on the SP but
the deadlocks could not be completely eliminated.
MPI buffering on the Alpha was adequate.
The maximum MPI buffering on an SP node was 64 MBytes, on the Alpha, 191 MBytes.
The following table show MPI parallel performance of the LU benchmark
(64x64x64) for the Alpha and SP.
The first column pair is one processor per node, the second pair
is using all four processors per node.
These tests used standard FORTRAN (no vendor libraries).
Nodes CPUs
alpha sp alpha sp
2 786.08 617.98 762.92 588.16
4 1708.1 1387.03 1604.05 1188.02
8 3384.03 2561.97 3265.83 2473.80
16 6190.89 5593.18 5556.02 4771.66
aggregate Mflops
Results for the FT benchmarks follow
Nodes CPUs
alpha sp alpha sp
4 633 465 580 307
8 1198 925 849 553
16 2221 1890 1019 1056
aggregate Mflops
Results for the NAS SP benchmark follow.
Nodes CPUs
alpha sp alpha sp
4 877 632 734 416
9 2310 1623 1837 1225
16 4344 2920 3143 2252
aggregate Mflops
The following plots the aggregate Mflop performance for ParkBench
QR factorization of 1000x1000 double precision matrix.
One can compare the performance of optimized FORTRAN versus the
vendor libraries (cxml/essl), and the difference in performance
when using all processors on a node versus just one processor (N)
per node.
Links/References
- ORNL's Pat Worley's
alpha evaluation
and
CRM performance on SP and ALPHA (lots of sqrt's)
PSTWM performance
- ORNL CCS pages for
Alpha SC
and SP3
- UT student, Jay Patel's
results July, 2000
- Compaq
alplha es40 cluster info
and
EV6 chip paper
and alpha 21624 hardware ref. and compiler writer's guide
- Compaq's alpha server performance info
- Alpha's
quadrix switch
or older Meiko fat-tree network
- IBM
large scale system info white papers
- IBM papers on POWER3
and
here
-
RS6000 switch performance
and
SP2 architecture paper
and other
sp2 articles
-
power3 tutorial and IBM
SP scientific redbook
- AIX/SP
thread tuning
and
poe/aix envirnoment variables
-
peak performance for Power 3
-
power 3 high nodes
- IBM's essl scientific library
and pessl parallel essl
and mass
intrinsics
and other optimization libraries mass, mpi, lapi, essl
- ParkBench
or euroben
or NAS parallell benchmarks
or hint
- stream benchmark
and splash
and lmbench
and mpio benchmarks
- PDS: The Performance Database Server
linpack and such
- hpl high perf linpack
for distributed memory
and ATLAS
- benchmarks papers
- atlas
- openmp
and
NASPB on OpenMP
and
PBN source
- openmp
microbenchmarks
- SPEC
- UT's papi
performance counter API
- Heller's
rabbit
-
Monitoring Application Performance Using Hardware Counters
-
cpu timers japanese
- ANL's MPICH performance
Last Modified
thd@ornl.gov
(touches: )
back to Tom Dunigan's page
or the ORNL home page