ORNL Sun Excalibur UltraSPARC-III tests
This data was collected in the fall of 2000.
Last Modified
Oak Ridge National Laboratory (ORNL) is currently performing an in-depth
evaluation of the Compaq AlphaServer SC parallel architecture
as part of its
evaluation of early systems project.
The primary tasks of the evaluation are to
- determine the most effective approaches to using the AlphaServer SC;
- evaluate benchmark and application performance, and compare with similar
systems from other vendors;
- evaluate system and system administration software reliability and
performance;
- predict scalability, both in terms of problem size and in number of
processors.
The emphasis of the evaluation is on application-relevant studies
for applications of importance to DOE.
However, standard benchmarks are still important for comparisons
with other systems.
The results presented here are from standard benchmarks and some
custom benchmarks and, as such, represent only one part of the
evaluation.
A Compaq Alpha cluster and IBM SP3 cluster at ORNL
were used for comparison to the Sun UltraSPARC-III
in the results presented below.
The results below are in the following categories:
ARCHITECTURE
The Sparc III unit we tested had two processors sharing memory.
Both the Alpha and SP consist of four processors sharing memory
on a single node.
The following table summarizes the main characteristics of
the SP and Alpha.
Specs: Alpha SC SP3 Sparc3
MHz 667 375 750
memory/node 2GB 2GB 1GB
L1 64K 64K 64K
L2 8MB 8MB 8MB
peak Mflops 2*MHz 4*MHz 2*MHz
peak mem BW 5.2GBs 1.6GBs 2.4GBs
The Sparc3 is running Solaris 5.8. We re-compiled and re-ran our tests
with Forte Developer 6 (update 1).
Sparc-III versions
SunOS roadrunner 5.8 Generic_108528-03 sun4u sparc SUNW,Sun-Blade-1000
cc: Sun WorkShop 6 update 1 C 5.2 2000/08/14
f90: Sun WorkShop 6 update 1 Fortran 95 6.1 2000/08/14
BENCHMARKS
We have used widely available benchmarks in combination with
our own custom benchmarks to characterize the performance
of the Alpha SC cluster.
Some of the older benchmarks may need to be modified for these
newer faster machines -- increasing repetitions to avoid 0 elapsed times,
increasing problem sizes to test out of cache performance.
Unless otherwise noted, the following compiler switches
were used on the Alpha and SP.
Sparc3: -fast -dalign -xarch=native -xO5
Alpha: -O4 -fast -arch ev6
SP: -O4 -qarch=auto -qtune=auto -qcache=auto -bmaxdata:0x70000000
Benchmarks were in C, FORTRAN, and FORTRAN90/OpenMP.
We also compared performance with the vendor runtime libraries,
sunperf (Sun), cxml (Alpha)
and essl (SP).
We used the following benchmarks in our tests:
- EuroBen 3.9 -- provides
serial benchmarks for low-level performance and applicaton
kernels (linear algebra, eigen value, FFT, QR).
Here is a summary of the benchmark modules.
euroben-dm provides some communication and parallel (MPI)
benchmarks.
The web site includes results from other systems.
- lmbench --
provides insight into OS (UNIX) performance and memory latencies.
The web site includes results from other systems.
- stream --
measures memory bandwidth for both serial and parallel configurations.
The web site includes results from other systems.
- Custom low-level benchmarks that we have used over the years
in evaluating memory and communication performance.
For the Sun, Alpha and the SP, gettimeofday() provides
microsecond wall-clock time (though one has to be sure MICROTIME
option is set in the Alpha OS kernel).
All have high-resolution cylce counters as well, but the Alpha
cycle counter is only 32-bits so rolls over in less than 7 seconds.
LOW LEVEL BENCHMARKS
The following table compares the performance of the Alpha and SP
for basic CPU operations.
These numbers (peak average Mflops)
are from the first 14 kernels of EuroBen's mod1a.
The 14th kernel is a rough estimate of peak FORTRAN performance since it
has a high re-use of operands.
alpha sp sparc3
broadcast 516 368 310
copy 324 295 272
addition 285 186 212
subtraction 288 166 204
multiply 287 166 200
division 55 64 44
dotproduct 609 655 672
X=X+aY 526 497 448
Z=X+aY 477 331 350
y=x1x2+x3x4 433 371 353
1st ord rec. 110 107 45
2nd ord rec. 136 61 73
2nd diff 633 743 714
9th deg. poly 701 709 1393
basic operations (Mflops) euroben mod1ac
The following table compares the performance of various intrinsics
(EuroBen mod1f).
alpha sp -O4 sparc3
x**y 8.3 1.8 3.1
sin 13 34.8 23.4
cos 12.8 21.4 16.6
sqrt 45.7 52.1 27.1
exp 15.8 30.7 29.8
log 15.1 30.8 28.9
tan 9.9 18.9 5.9
asin 13.3 10.4 4.4
sinh 10.7 2.3 2.2
instrinsics (Mcalls/s) euroben mod1f (N=10000)
The following table compares the performance (Mflops) of a simple
FORTRAN matrix (REAL*8 400x400) multiply compared with the performance
of DGEMM from the vendor math library (-lcxml for the Alpha,
-lessl for the SP, -lsunperf for the Sparc).
Also the Mflops for 1000x1000 Linpack are reported
from netlib.
alpha sp sparc3
ftn 71.7 45.2 168
lib 1181.5 1320.5 640
linpack 1031 1236 -
In the following graph, the performance of the
ATLAS DGEMM is compared with the vendor libraries.
Notice that the ATLAS library out performs the Sun sunperf library.
The ATLAS sparc3 build does not use -fast.
The following table compares
optimized FORTRAN performance for Euroben mod2a,
matrix-vector dot product and product.
------------------------------------------------------------------------
alpha sp sparc3 alpha sp sparc3
Problem size| MxV-ddot| MxV-ddot| MxV-ddot|MxV-axpy |MxV-axpy| MxV-axpy|
m | n | Mflops | Mflops | Mflops | Mflops |Mflops | Mflops |
-----------------------------------------------------------------------
100 | 100 | 411.7 | 423.9 | 332.9 | 101.9 | 401.9 | 359.4 |
200 | 200 | 442.3 | 416.8 | 322.2 | 227.4 | 421.1 | 318.0 |
500 | 500 | 66.1 | 18.7 | 306.4 | 205.4 | 411.9 | 299.0 |
1000 | 1000 | 31.8 | 17.1 | 251.1 | 205.6 | 274.5 | 262.0 |
2000 | 2000 | 27.5 | 16.1 | 136.0 | 66.9 | 207.9 | 139.5 |
------------------------------------------------------------------------
The following table compares the single
processor performance (Mflops) of the
Alpha and SP for the Euroben mod2g,
a 2-D Haar wavelet transform test.
|----------------------------------------------------
| Order | alpha | SP | sparc3 |
| n1 | n2 | (Mflop/s) | (Mflop/s) | (Mflop/s)|
|----------------------------------------------------
| 16 | 16 | 142.56 | 79.63 | 86.36 |
| 32 | 16 | 166.61 | 96.69 | 85.82 |
| 32 | 32 | 208.06 | 115.43 | 98.12 |
| 64 | 32 | 146.16 | 108.74 | 96.53 |
| 64 | 64 | 111.46 | 111.46 | 84.15 |
| 128 | 64 | 114.93 | 101.49 | 89.63 |
| 128 | 128 | 104.46 | 97.785 | 92.73 |
| 256 | 128 | 86.869 | 64.246 | 72.89 |
| 256 | 256 | 71.033 | 44.159 | 54.74 |
| 512 | 256 | 65.295 | 41.964 | 55.30 |
|----------------------------------------------------
The following plots the performance (Mflops) of
Euroben mod2b, a dense linear system test,
for both optimized FORTRAN and using the BLAS from the vendor library
(cxml/essl/sunperf).
The following plots the performance (Mflops) of
Euroben mod2d, a dense eigenvalue test,
for both optimized FORTRAN and using the BLAS from the vendor library.
For the Alpha, -O4 optimization failed, so this data uses -O3.
For the Sun, -fast failed for the Forte update 1 compiler, so this
plot is without -fast (older compiler version was faster with -fast).
The following plots the performance (iterations/second) of
Euroben mod2e, a sparse eigenvalue test.
Memory performance
The, Sparc, SP and the Alpha have 64 KB L1 caches and 8 MB L2 caches.
The following figure shows the data rates for a simple
FORTRAN loop to load ( y = y+x(i)) for different vector sizes.
For the Sparc we compare with and without the -fast compiler
option.
The stream benchmark
is a program that measures main memory throughput for several
simple operations.
The following table shows the memory data rates for a single
processor.
Stream 1 CPU alpha sp3 sparc3
Function Rate (MB/s)
Copy: 1090.6601 598.9804 418.7604
Scale: 997.5083 576.2223 526.0903
Add: 1058.0155 770.8110 406.0707
Triad: 1133.4106 780.0816 447.6192
stream (C) memory throughput
The aggregate data rate for multiple threads (f90/openmp) is reported
in the following table (input arguments: threads*2000000,0,10).
The last two columns are from a explicit-threaded C code.
copy scale add triad ddot x+y
sparc1 490 419 399 364 248 381
sparc2 780 670 659 614 466 639
alpha1 1339 1265 1273 1383 1376 1115
alpha2 1768 1711 1839 1886 1852 1729
SP 1 523 561 581 583 1080 729
SP 2 686 797 813 909 1262 923
stream (f90/omp) multiple threads (aggregate MB/sec)
The following figure shows the Mflops for one processor for various
problem sizes
for the EuroBen mod2f, a 1-D FFT.
Data access is irregular, but cache boundaries are still apparent.
The hint
benchmark measures computation and memory efficiency as
the problem size increases.
The following graph shows the performance of a single processor
for the Alpha (66.9 MQUIPS), Sun (42.7 MQUIPS), and SP (27.3 MQUIPS).
The L1 and L2 cache boundaries are visible.
The runtime for a FORTRAN molecular dynamics code,
mdbnch, on the
Sparc3 was 5.3 seconds (4.2 seconds on the SP, 3.6 seconds on the Alpha).
The
lmbench benchmark
measures various UNIX and system characeristics.
Here are some preliminary numbers
for runs on the sparc3, alpha, and SP.
The cache/memory latencies reported by lmbench are
alpha sp3 sparc3
L1 4 5 2
L2 27 32 17
memory 210 300 180
latency in nanoseconds
Open/close times in lmbench
are much slower for the Alpha, though file create/delete
are faster on the Alpha.
EuroBen's mod3a tests matrix computation with file I/O (out
of core).
The following tables compare mod3a performance for the Sparc3, Alpha, and SP.
No attempt was made to optimize I/O performance.
Mod3a: Out-of-core Matrix-vector multiplication
Alpha
--------------------------------------------------------------------------
Row | Column | Exec. time | Mflop rate | Read rate | Write rate |
(n) | (m) | (sec) | (Mflop/s) | (MB/s) | (MB/s) |
--------------------------------------------------------------------------
25000 | 20000 | 0.56200E-01| 17.793 | 153.63 | 33.945 |
50000 | 20000 | 0.13700 | 14.598 | 117.32 | 35.905 |
100000 | 100000 | 0.67409 | 29.668 | 141.19 | 35.884 |
250000 | 100000 | 2.6982 | 18.531 | 117.61 | 35.770 |
--------------------------------------------------------------------------
Sparc3
--------------------------------------------------------------------------
25000 | 20000 | 0.12826 | 7.7968 | 59.021 | 22.935 |
50000 | 20000 | 0.30209 | 6.6205 | 59.577 | 6.0399 |
100000 | 100000 | 1.4451 | 13.840 | 61.763 | 12.818 |
250000 | 100000 | 5.6117 | 8.9098 | 53.766 | 14.579 |
--------------------------------------------------------------------------
SP
--------------------------------------------------------------------------
25000 | 20000 | .81841 | 1.2219 | 244.76 | .27172 |
50000 | 20000 | 1.6479 | 1.2136 | 244.61 | .26217 |
100000 | 100000 | 1.4766 | 13.544 | 241.12 | .84673 |
250000 | 100000 | 3.6024 | 13.879 | 239.51 | 1.1294 |
--------------------------------------------------------------------------
Three simple I/O tests were used to write and read a 100 MB file
and 1GB file using 8K blocks to the 18GB SCSI drive on the Sparc3.
100 MB 1 GB
Test Write Read Write Read
Bonnie 22.1 408.2 23.3 62.9
iozone 19.5 405.9 23.8 70.5
thdio 18.6 374.4 19.8 67.6 uses fsync() before close on write
data rate (MB/s)
SHARED-MEMORY BENCHMARKS
Both the Alpha and SP consist of a cluster of shared-memory nodes,
each node with four processors sharing a common memory.
We tested the performance of a shared-memory with various
C programs with explicit thread calls and with FORTRAN Open MP codes.
The following table shows the performance of thread/join in C
for two processors.
The test repeatedly creates and joins threads.
Often, it is more efficient to create the threads once, and then
provide them work as needed.
I suspect this is what FORTRAN Open MP is doing for "parallel do".
The table also includes time for iterative test of a parallel do in FORTRAN
on two processors.
alpha SP sparc3
C threads 47.7 96 52
FORTRAN do 2.1 12.7 5.3
thread create/join time in microseconds
The following table shows the time required to lock-unlock
using pthread_mutex_lock with various number of threads.
threads alpha sp sparc3
1 0.26 0.57 0.22
2 1.5 1.7 1.6
time for lock/unlock (us)
The following table compares the performance of simple C barrier
program using a single lock and spinning on a shared variable
along with pthread_yield.
A version based on condition variables was an order of magnitude slower.
threads alpha sp sparc3
1 0.25 0.6 0.27
2 1.36 4.4 1.6
C barrier times (us)
The following table illustrates linear speedup for an embarrassingly
parallel integration.
A C code with explicit thread management is compared with FORTRAN
Open MP.
Both just used -O optimization.
fortran C threads
alpha SP sparc3 alpha SP sparc3
1 252 102 264 166 52 75
2 502 204 526 331 104 149
rectangle rule (Mflops) -O optimization
The following table illustrates an explicit thread implementation
of Cholesky factorization of a 1000x1000 double precision
matrix in C (-O optimization).
threads alpha sp sparc3
1 150 125 84
2 269 238 159
cholp 1k matrix factor (mflops) -O optimization
The following table compares optimized FORTRAN OpenMP
doing a simple Jacobi iteration.
problem size 100x100 1000x1000
threads alpha sp sparc3 alpha sp sparc3
1 4308 3656 3071 27 17 20
2 8262 5707 5278 42 27 28
iterations per second
MESSAGE-PASSING BENCHMARKS
Internode communication can be accomplished with IP, PVM, or MPI.
We report MPI performance using shared-memory on the Sparc3 (MPICH)
and within a node on the SP and Alpha.
The following table summarizes the measured communication
characteristics of the three machines, using an 8 byte message for
latency and a 1 MByte message for throughput.
We also include node-to-node times for the SP and Alpha and network (IP)
performance.
latency (min, 1 way, us) and bandwidth (MBs)
-- latency Bandwidth (min 1 way us, MBs)
sparc3 cpu 1.5 521
sparc3 IP-gigE/1500 142 61 (measured with alpha)
alpha node 5.5 198
alpha cpu 5.8 623
alpha IP-sw 123 48
alpha IP-gigE/1500 76 44
alpha IP-100E 70 11
sp node 16.3 139
sp cpu 8.1 512
sp IP-sw 82 46
sp IP-gigE/1500 91 47
sp IP-gigE/9000 136 84
sp IP-100E 93 12
The following graph shows the MPI bandwidth for communications between
two processrs measured with EuroBen's mod1h.
Since we have only 2 Sparc3 processors we can't say much about scaling
of MPI communications, but the following table
shows the performance of aggregate communication operations
(barrier, broadcast, sum-reduction) using two processors for the three
architectures.
time (us)
Test sparc3 alpha SP
mpibarrier 8 11 10
mpibcast 3.1 12.5 6.7
mpireduce 4.3 11 9
Links/References
- Worley's
Sparc3 results PSTSWM
- ORNL's Dunigan's Alpha/SP parallel performance
- ORNL's Pat Worley's
alpha evaluation
- ORNL CCS pages for
Alpha SC
and SP3
- UT student, Jay Patel's
results July, 2000
- sparc3 announcement
- sun sparc 3
and solaris
- sun
performance tuning
and sun hpc docs
- Sun starfire
enterprise 10000
and performance
and sun hpc servers
- Compaq
alplha es40 cluster info
and
EV6 chip paper
and alpha 21624 hardware ref. and compiler writer's guide
- Compaq's alpha server performance info
- Alpha's
quadrix switch
or older Meiko fat-tree network
- IBM papers on POWER3
and
RS6000 switch performance
and
SP2 architecture paper
and other
sp2 articles
-
power3 tutorial and IBM
SP scientific redbook
- IBM's essl scientific library
and mass
intrinsics
and other optimization libraries mass, mpi, lapi, essl
- ParkBench
or euroben
or NAS parallell benchmarks
or hint
- stream benchmark
and splash
and lmbench
- PDS: The Performance Database Server
linpack and such
- benchmarks papers
- atlas
- SPEC
- UT's papi
performance counter API
- Heller's
rabbit
-
Monitoring Application Performance Using Hardware Counters
-
cpu timers japanese
- ANL's MPICH performance
Last Modified
thd@ornl.gov
(touches: )
back to Tom Dunigan's page
or the ORNL home page