Dunigan's Sparc3 Testing

ORNL Sun Excalibur UltraSPARC-III tests

This data was collected in the fall of 2000.
Last Modified

Oak Ridge National Laboratory (ORNL) is currently performing an in-depth evaluation of the Compaq AlphaServer SC parallel architecture as part of its evaluation of early systems project. The primary tasks of the evaluation are to

determine the most effective approaches to using the AlphaServer SC;
evaluate benchmark and application performance, and compare with similar systems from other vendors;
evaluate system and system administration software reliability and performance;
predict scalability, both in terms of problem size and in number of processors.

The emphasis of the evaluation is on application-relevant studies for applications of importance to DOE. However, standard benchmarks are still important for comparisons with other systems. The results presented here are from standard benchmarks and some custom benchmarks and, as such, represent only one part of the evaluation. A Compaq Alpha cluster and IBM SP3 cluster at ORNL were used for comparison to the Sun UltraSPARC-III in the results presented below. The results below are in the following categories:

architecture -- configuration summaries
benchmarks -- benchmark descriptions
low-level results -- base CPU and memory results
shared-memory results -- single node/thread performance
message-passing (MPI) results -- latency/bandwith, collectives

ARCHITECTURE

The Sparc III unit we tested had two processors sharing memory. Both the Alpha and SP consist of four processors sharing memory on a single node. The following table summarizes the main characteristics of the SP and Alpha.

Specs: Alpha SC SP3 Sparc3 MHz 667 375 750 memory/node 2GB 2GB 1GB L1 64K 64K 64K L2 8MB 8MB 8MB peak Mflops 2*MHz 4*MHz 2*MHz peak mem BW 5.2GBs 1.6GBs 2.4GBs The Sparc3 is running Solaris 5.8. We re-compiled and re-ran our tests with Forte Developer 6 (update 1). Sparc-III versions SunOS roadrunner 5.8 Generic_108528-03 sun4u sparc SUNW,Sun-Blade-1000 cc: Sun WorkShop 6 update 1 C 5.2 2000/08/14 f90: Sun WorkShop 6 update 1 Fortran 95 6.1 2000/08/14

BENCHMARKS

We have used widely available benchmarks in combination with our own custom benchmarks to characterize the performance of the Alpha SC cluster. Some of the older benchmarks may need to be modified for these newer faster machines -- increasing repetitions to avoid 0 elapsed times, increasing problem sizes to test out of cache performance. Unless otherwise noted, the following compiler switches were used on the Alpha and SP.

Sparc3: -fast -dalign -xarch=native -xO5 Alpha: -O4 -fast -arch ev6 SP: -O4 -qarch=auto -qtune=auto -qcache=auto -bmaxdata:0x70000000 Benchmarks were in C, FORTRAN, and FORTRAN90/OpenMP. We also compared performance with the vendor runtime libraries, sunperf (Sun), cxml (Alpha) and essl (SP). We used the following benchmarks in our tests:

EuroBen 3.9 -- provides serial benchmarks for low-level performance and applicaton kernels (linear algebra, eigen value, FFT, QR). Here is a summary of the benchmark modules. euroben-dm provides some communication and parallel (MPI) benchmarks. The web site includes results from other systems.
lmbench -- provides insight into OS (UNIX) performance and memory latencies. The web site includes results from other systems.
stream -- measures memory bandwidth for both serial and parallel configurations. The web site includes results from other systems.
Custom low-level benchmarks that we have used over the years in evaluating memory and communication performance.

For the Sun, Alpha and the SP, gettimeofday() provides microsecond wall-clock time (though one has to be sure MICROTIME option is set in the Alpha OS kernel). All have high-resolution cylce counters as well, but the Alpha cycle counter is only 32-bits so rolls over in less than 7 seconds.

LOW LEVEL BENCHMARKS The following table compares the performance of the Alpha and SP for basic CPU operations. These numbers (peak average Mflops) are from the first 14 kernels of EuroBen's mod1a. The 14th kernel is a rough estimate of peak FORTRAN performance since it has a high re-use of operands.

alpha sp sparc3 broadcast 516 368 310 copy 324 295 272 addition 285 186 212 subtraction 288 166 204 multiply 287 166 200 division 55 64 44 dotproduct 609 655 672 X=X+aY 526 497 448 Z=X+aY 477 331 350 y=x1x2+x3x4 433 371 353 1st ord rec. 110 107 45 2nd ord rec. 136 61 73 2nd diff 633 743 714 9th deg. poly 701 709 1393 basic operations (Mflops) euroben mod1ac

The following table compares the performance of various intrinsics (EuroBen mod1f).

alpha sp -O4 sparc3 x**y 8.3 1.8 3.1 sin 13 34.8 23.4 cos 12.8 21.4 16.6 sqrt 45.7 52.1 27.1 exp 15.8 30.7 29.8 log 15.1 30.8 28.9 tan 9.9 18.9 5.9 asin 13.3 10.4 4.4 sinh 10.7 2.3 2.2 instrinsics (Mcalls/s) euroben mod1f (N=10000) The following table compares the performance (Mflops) of a simple FORTRAN matrix (REAL*8 400x400) multiply compared with the performance of DGEMM from the vendor math library (-lcxml for the Alpha, -lessl for the SP, -lsunperf for the Sparc). Also the Mflops for 1000x1000 Linpack are reported from netlib. alpha sp sparc3 ftn 71.7 45.2 168 lib 1181.5 1320.5 640 linpack 1031 1236 - In the following graph, the performance of the ATLAS DGEMM is compared with the vendor libraries. Notice that the ATLAS library out performs the Sun sunperf library. The ATLAS sparc3 build does not use -fast.

The following table compares optimized FORTRAN performance for Euroben mod2a, matrix-vector dot product and product.

------------------------------------------------------------------------ alpha sp sparc3 alpha sp sparc3 Problem size| MxV-ddot| MxV-ddot| MxV-ddot|MxV-axpy |MxV-axpy| MxV-axpy| m | n | Mflops | Mflops | Mflops | Mflops |Mflops | Mflops | ----------------------------------------------------------------------- 100 | 100 | 411.7 | 423.9 | 332.9 | 101.9 | 401.9 | 359.4 | 200 | 200 | 442.3 | 416.8 | 322.2 | 227.4 | 421.1 | 318.0 | 500 | 500 | 66.1 | 18.7 | 306.4 | 205.4 | 411.9 | 299.0 | 1000 | 1000 | 31.8 | 17.1 | 251.1 | 205.6 | 274.5 | 262.0 | 2000 | 2000 | 27.5 | 16.1 | 136.0 | 66.9 | 207.9 | 139.5 | ------------------------------------------------------------------------

The following table compares the single processor performance (Mflops) of the Alpha and SP for the Euroben mod2g, a 2-D Haar wavelet transform test.

|---------------------------------------------------- | Order | alpha | SP | sparc3 | | n1 | n2 | (Mflop/s) | (Mflop/s) | (Mflop/s)| |---------------------------------------------------- | 16 | 16 | 142.56 | 79.63 | 86.36 | | 32 | 16 | 166.61 | 96.69 | 85.82 | | 32 | 32 | 208.06 | 115.43 | 98.12 | | 64 | 32 | 146.16 | 108.74 | 96.53 | | 64 | 64 | 111.46 | 111.46 | 84.15 | | 128 | 64 | 114.93 | 101.49 | 89.63 | | 128 | 128 | 104.46 | 97.785 | 92.73 | | 256 | 128 | 86.869 | 64.246 | 72.89 | | 256 | 256 | 71.033 | 44.159 | 54.74 | | 512 | 256 | 65.295 | 41.964 | 55.30 | |---------------------------------------------------- The following plots the performance (Mflops) of Euroben mod2b, a dense linear system test, for both optimized FORTRAN and using the BLAS from the vendor library (cxml/essl/sunperf).

The following plots the performance (Mflops) of Euroben mod2d, a dense eigenvalue test, for both optimized FORTRAN and using the BLAS from the vendor library. For the Alpha, -O4 optimization failed, so this data uses -O3. For the Sun, -fast failed for the Forte update 1 compiler, so this plot is without -fast (older compiler version was faster with -fast).

The following plots the performance (iterations/second) of Euroben mod2e, a sparse eigenvalue test.

Memory performance The, Sparc, SP and the Alpha have 64 KB L1 caches and 8 MB L2 caches. The following figure shows the data rates for a simple FORTRAN loop to load ( y = y+x(i)) for different vector sizes. For the Sparc we compare with and without the -fast compiler option.

The stream benchmark is a program that measures main memory throughput for several simple operations. The following table shows the memory data rates for a single processor.

Stream 1 CPU alpha sp3 sparc3 Function Rate (MB/s) Copy: 1090.6601 598.9804 418.7604 Scale: 997.5083 576.2223 526.0903 Add: 1058.0155 770.8110 406.0707 Triad: 1133.4106 780.0816 447.6192 stream (C) memory throughput The aggregate data rate for multiple threads (f90/openmp) is reported in the following table (input arguments: threads*2000000,0,10). The last two columns are from a explicit-threaded C code. copy scale add triad ddot x+y sparc1 490 419 399 364 248 381 sparc2 780 670 659 614 466 639 alpha1 1339 1265 1273 1383 1376 1115 alpha2 1768 1711 1839 1886 1852 1729 SP 1 523 561 581 583 1080 729 SP 2 686 797 813 909 1262 923 stream (f90/omp) multiple threads (aggregate MB/sec)

The following figure shows the Mflops for one processor for various problem sizes for the EuroBen mod2f, a 1-D FFT. Data access is irregular, but cache boundaries are still apparent.

The hint benchmark measures computation and memory efficiency as the problem size increases. The following graph shows the performance of a single processor for the Alpha (66.9 MQUIPS), Sun (42.7 MQUIPS), and SP (27.3 MQUIPS). The L1 and L2 cache boundaries are visible.

The runtime for a FORTRAN molecular dynamics code, mdbnch, on the Sparc3 was 5.3 seconds (4.2 seconds on the SP, 3.6 seconds on the Alpha).

The lmbench benchmark measures various UNIX and system characeristics. Here are some preliminary numbers for runs on the sparc3, alpha, and SP. The cache/memory latencies reported by lmbench are

alpha sp3 sparc3 L1 4 5 2 L2 27 32 17 memory 210 300 180 latency in nanoseconds Open/close times in lmbench are much slower for the Alpha, though file create/delete are faster on the Alpha. EuroBen's mod3a tests matrix computation with file I/O (out of core). The following tables compare mod3a performance for the Sparc3, Alpha, and SP. No attempt was made to optimize I/O performance. Mod3a: Out-of-core Matrix-vector multiplication Alpha -------------------------------------------------------------------------- Row | Column | Exec. time | Mflop rate | Read rate | Write rate | (n) | (m) | (sec) | (Mflop/s) | (MB/s) | (MB/s) | -------------------------------------------------------------------------- 25000 | 20000 | 0.56200E-01| 17.793 | 153.63 | 33.945 | 50000 | 20000 | 0.13700 | 14.598 | 117.32 | 35.905 | 100000 | 100000 | 0.67409 | 29.668 | 141.19 | 35.884 | 250000 | 100000 | 2.6982 | 18.531 | 117.61 | 35.770 | -------------------------------------------------------------------------- Sparc3 -------------------------------------------------------------------------- 25000 | 20000 | 0.12826 | 7.7968 | 59.021 | 22.935 | 50000 | 20000 | 0.30209 | 6.6205 | 59.577 | 6.0399 | 100000 | 100000 | 1.4451 | 13.840 | 61.763 | 12.818 | 250000 | 100000 | 5.6117 | 8.9098 | 53.766 | 14.579 | -------------------------------------------------------------------------- SP -------------------------------------------------------------------------- 25000 | 20000 | .81841 | 1.2219 | 244.76 | .27172 | 50000 | 20000 | 1.6479 | 1.2136 | 244.61 | .26217 | 100000 | 100000 | 1.4766 | 13.544 | 241.12 | .84673 | 250000 | 100000 | 3.6024 | 13.879 | 239.51 | 1.1294 | -------------------------------------------------------------------------- Three simple I/O tests were used to write and read a 100 MB file and 1GB file using 8K blocks to the 18GB SCSI drive on the Sparc3. 100 MB 1 GB Test Write Read Write Read Bonnie 22.1 408.2 23.3 62.9 iozone 19.5 405.9 23.8 70.5 thdio 18.6 374.4 19.8 67.6 uses fsync() before close on write data rate (MB/s)

SHARED-MEMORY BENCHMARKS

Both the Alpha and SP consist of a cluster of shared-memory nodes, each node with four processors sharing a common memory. We tested the performance of a shared-memory with various C programs with explicit thread calls and with FORTRAN Open MP codes.

The following table shows the performance of thread/join in C for two processors. The test repeatedly creates and joins threads. Often, it is more efficient to create the threads once, and then provide them work as needed. I suspect this is what FORTRAN Open MP is doing for "parallel do". The table also includes time for iterative test of a parallel do in FORTRAN on two processors.

alpha SP sparc3 C threads 47.7 96 52 FORTRAN do 2.1 12.7 5.3 thread create/join time in microseconds

The following table shows the time required to lock-unlock using pthread_mutex_lock with various number of threads.

threads alpha sp sparc3 1 0.26 0.57 0.22 2 1.5 1.7 1.6 time for lock/unlock (us) The following table compares the performance of simple C barrier program using a single lock and spinning on a shared variable along with pthread_yield. A version based on condition variables was an order of magnitude slower. threads alpha sp sparc3 1 0.25 0.6 0.27 2 1.36 4.4 1.6 C barrier times (us) The following table illustrates linear speedup for an embarrassingly parallel integration. A C code with explicit thread management is compared with FORTRAN Open MP. Both just used -O optimization. fortran C threads alpha SP sparc3 alpha SP sparc3 1 252 102 264 166 52 75 2 502 204 526 331 104 149 rectangle rule (Mflops) -O optimization The following table illustrates an explicit thread implementation of Cholesky factorization of a 1000x1000 double precision matrix in C (-O optimization). threads alpha sp sparc3 1 150 125 84 2 269 238 159 cholp 1k matrix factor (mflops) -O optimization The following table compares optimized FORTRAN OpenMP doing a simple Jacobi iteration. problem size 100x100 1000x1000 threads alpha sp sparc3 alpha sp sparc3 1 4308 3656 3071 27 17 20 2 8262 5707 5278 42 27 28 iterations per second

MESSAGE-PASSING BENCHMARKS

Internode communication can be accomplished with IP, PVM, or MPI. We report MPI performance using shared-memory on the Sparc3 (MPICH) and within a node on the SP and Alpha. The following table summarizes the measured communication characteristics of the three machines, using an 8 byte message for latency and a 1 MByte message for throughput. We also include node-to-node times for the SP and Alpha and network (IP) performance.

latency (min, 1 way, us) and bandwidth (MBs) -- latency Bandwidth (min 1 way us, MBs) sparc3 cpu 1.5 521 sparc3 IP-gigE/1500 142 61 (measured with alpha) alpha node 5.5 198 alpha cpu 5.8 623 alpha IP-sw 123 48 alpha IP-gigE/1500 76 44 alpha IP-100E 70 11 sp node 16.3 139 sp cpu 8.1 512 sp IP-sw 82 46 sp IP-gigE/1500 91 47 sp IP-gigE/9000 136 84 sp IP-100E 93 12

The following graph shows the MPI bandwidth for communications between two processrs measured with EuroBen's mod1h.

Since we have only 2 Sparc3 processors we can't say much about scaling of MPI communications, but the following table shows the performance of aggregate communication operations (barrier, broadcast, sum-reduction) using two processors for the three architectures.

time (us) Test sparc3 alpha SP mpibarrier 8 11 10 mpibcast 3.1 12.5 6.7 mpireduce 4.3 11 9

Links/References

Worley's Sparc3 results PSTSWM

ORNL's Dunigan's Alpha/SP parallel performance

ORNL's Pat Worley's alpha evaluation

ORNL CCS pages for Alpha SC and SP3

UT student, Jay Patel's results July, 2000

sparc3 announcement

sun sparc 3 and solaris

sun performance tuning and sun hpc docs

Sun starfire enterprise 10000 and performance and sun hpc servers

Compaq alplha es40 cluster info and EV6 chip paper and alpha 21624 hardware ref. and compiler writer's guide

Compaq's alpha server performance info

Alpha's quadrix switch or older Meiko fat-tree network

IBM papers on POWER3 and RS6000 switch performance and SP2 architecture paper and other sp2 articles

power3 tutorial and IBM SP scientific redbook

IBM's essl scientific library and mass intrinsics and other optimization libraries mass, mpi, lapi, essl

ParkBench or euroben or NAS parallell benchmarks or hint

stream benchmark and splash and lmbench

PDS: The Performance Database Server linpack and such

benchmarks papers

atlas

SPEC

UT's papi performance counter API

Heller's rabbit

Monitoring Application Performance Using Hardware Counters

cpu timers japanese

ANL's MPICH performance

Research Sponsors: Sun has graciously loaned ORNL a sparc III for evaluation. ORNL's evaluation research is funded by the Mathematical, Information, and Computational Sciences Division, within the Office of Advanced Scientific Computing Research of the Office of Science, Department of Energy. The application-specific evaluations are also supported by the sponsors of the individual applications research areas.

Last Modified thd@ornl.gov (touches: )
back to Tom Dunigan's page or the ORNL home page

ORNL Sun Excalibur UltraSPARC-III tests

Research Sponsors