Recent updates:
The results on this page are just low-level benchmarks,
NOTE The tests on the SGI Altix are using early versions of the compilers and libraries, and we expect performance to continue to improve with new releases. We are still discovering optimal compiler settings etc..
|
Oak Ridge National Laboratory (ORNL) is currently performing an in-depth evaluation of the Cray X1 system as part of its evaluation of early systems project. The primary tasks of the evaluation are to
ARCHITECTURE |
The Altix has 256 processors sharing memory. There are two system images (2x128), but the memory is global to all 256 processors (OpenMP, SHMEM, and MPI). The Cray X1 at ORNL has 128 nodes as of August, 2003. Each node has 4 MSPs, each MSP has 4 SSPs, and each SSP has two vector units. The Cray "processor/CPU" in the results below is one MSP. All 4 MSP's on a node share memory. The Power4 consists of one node with 32 processors (4 MCM's) sharing memory. Both the Alpha and SP3 consist of four processors sharing memory on a single node. The following table summarizes the main characteristics of the machines
BENCHMARKS |
We have used widely available benchmarks in combination with our own custom benchmarks to characterize the performance of the X1. Some of the older benchmarks may need to be modified for these newer faster machines -- increasing repetitions to avoid 0 elapsed times, increasing problem sizes to test out of cache performance. Unless otherwise noted, the following compiler switches were used on the Alpha and SP.
MEMORY PERFORMANCE |
The X1 MSP has a 2MB L2 cache shared among the four SSPs. Both the SP3 and the Alpha have 64 KB L1 caches and 8 MB L2 caches. The SP4 has a 32KB L1 (FIFO), a 1.4 L2 (shared between 2 processors), and a 128 MB L3. The following figure shows the data rates for a simple FORTRAN loop to load ( y = y+x(i)), store (y(i)=1), and copy (y(i)=x(i)), for different vector sizes. Data is also included for four threads. (Beware of the linear interpolation between data points, and note we need to extend the test beyond 128 MB to get out of the SP4 L3 cache. It has been suggested the the "dcbz" SP4 instruction that allocates the target cache line in the L2 without loading it from memory first could further improve SP4 performance. Also see McCalpin's stream2 benchmark.) (Revised 12/1/03)
The
MAPS benchmark also characterizes memory access performance.
Plotted are load/store bandwidth for sequential (stride 1) and random
access.
Load is calculated from s=s+x(i)*y(i) and store from
x(i)= s.
Revised 10/20/03
The tabletoy benchmark (C) makes random writes of 64-bit integers in a shared memory, parallelization is permitted with possibly non-coherent updates. The X1 number is for vectorizing the inner loop (multistreaming was an order of magnitude slower 88 MBs). Data rate in the following table is for a 268MB table. We include multi-threaded altix, opteron, sp3 (NERSC), and sp4 data as well. Revised 10/21/03
The stream benchmark is a program that measures main memory throughput for several simple operations. The aggregate data rate for multiple threads is reported in the following table. Recall, that the "peak" memory data rate for the X1 is 24 GBs/MSP, Alpha is 5.2 GBs, p690 is 51 GBs/MCM, and for the SP3 is 1.6 GBs. Data for the 16-way SP3 (375 Mhz, Nighthawk II) at NERSC is included too. Data for the Alpha ES45 (1 GHz, 8 GBs memory bw) is obtained from the streams data base. Data for p690/sp4 is with affinity enabled (6/1/02). The X1 uses (aprun -A) and we include the data rates for a single SSP as well as the aggregate rate for 4 SSP's running separate copies of the single stream test. (Revised 10/20/03)
We have included our X1 results for 16 MSPs
in the following table from the
stream top 20 (10/7/03).
Machine ID ncpus COPY SCALE ADD TRIAD
------------------------------------------------------------------------
NEC_SX-7 32 876174.7 865144.1 869179.2 872259.1
NEC_SX-5-16A 16 607492.0 590390.0 607412.0 583069.0
SGI_Altix_3000 256 414573.0 412108.0 485323.0 488274.0
NEC_SX-4 32 434784.0 432886.0 437358.0 436954.0
HP_AlphaServer_GS1280 64 347712.3 341890.6 373126.5 377727.8
Cray_T932_321024-3E 32 310721.0 302182.0 359841.0 359270.0
CRAY X1 MSP 16 306891.9 296893.8 334403.7 311499.9
NEC_SX-6 8 202627.2 192306.2 190231.3 213024.3
Cray_C90 16 105497.0 104656.0 101736.0 103812.0
SGI_Origin3800-500 256 87019.5 85514.4 101695.6 99680.2
HP_Integrity_SuperDome 64 82695.0 82476.0 83013.0 84223.0
IBM_eServer_p690+ 32 51455.0 53425.0 58651.0 58891.0
Sun_F15K 72 54665.4 47703.7 46090.7 50724.3
SGI_Origin2000-250 256 42824.2 43213.5 48285.8 49275.5
Cray_SV1ex 32 42317.8 42237.9 47829.8 47821.9
HP_AlphaServer_ES80 8 39898.0 40532.0 44519.0 44467.0
IBM_eServer_p670+ 16 32947.0 33673.0 35925.0 36818.0
IBM_eServer_p690_Turbo 32 28611.0 28994.0 32222.0 32249.0
Cray_Y-MP 8 19291.6 19294.2 26588.9 26802.2
HP_SuperDome_750 64 25762.3 21769.9 25675.0 26549.2
IBM_eServer_p690_HPC 16 20267.0 20265.0 24706.0 25058.0
The following graph illustrates the effect of various strides
on memory bandwidth of triad.
The hint benchmark measures computation and memory efficiency as the problem size increases. (This is C hint version 1, 1994.) The following graph shows the performance of a single processor for the X1 (12.2 MQUIPS), Alpha (66.9 MQUIPS), Altix (102.6 MQUIPS), and SP4 (74.9 MQUIPS). The L1 and L2 cache boundaries are visible, as well as the Altix and SP4's L3. Revised 10/20/03
The lmbench benchmark measures various UNIX and system characeristics. Here are some preliminary numbers (revised 10/20/03) for runs on a service and compute node of alpha and SP3/4 (version 2).
LOW LEVEL BENCHMARKS (single processor) |
The following table compares the performance of the X1, Alpha, and SP for basic CPU operations. These numbers are from the first few kernels of EuroBen's mod1ac. The 14th kernel (9th degree poly) is a rough estimate of peak FORTRAN performance since it has a high re-use of operands. Revised 10/20/03
The following table compares the performance of various intrinsics (EuroBen mod1f). Revised 10/20/03
Rice's libgoto_it2-r0.7 and Intel Math Kernel Library ( mkl) gets higher DGEMM performance than -lscs as illustrated in the following plot. Revised 10/20/03
The following graph compares the vendor library implementation of an LU factorization (DGETRF) using partial pivoting with row interchanges. Revised 10/20/03
The following plots the performance of DAXPY for the various architectures and using the various runtime libraries on the Altix. The effect of the cache is apparent for both the X1 (2 MB) and Altix (6 MB). Revised 2/9/04
The following graph compares optimized FORTRAN performance (no sci/essl/cxml) for Euroben mod2a, matrix-vector dot product and product. Revised 10/20/03
The following table compares the single processor performance (Mflops) of the Alpha and IBMs for the Euroben mod2g, a 2-D Haar wavelet transform test. (Revised 10/20/03)
The following plots the performance (Mflops) of Euroben mod2d, a dense eigenvalue test, for both optimized FORTRAN and using the BLAS from the vendor library. For the Alpha, -O4 optimization failed, so this data uses -O3. Revised 10/20/03
The following plots the performance (iterations/second) of Euroben mod2e, a sparse eigenvalue test (no vendor libraries). At this time, the Cray FORTRAN compiler seems unable to either vectorize or stream this code. Revised 10/20/03
The following figures shows the FORTRAN Mflops for one processor for various problem sizes for the EuroBen mod2f, a 1-D FFT (complex to complex). The rate is for the transform only (no initialization time). Revised 10/20/03
The following compares a 1-D FFT using the FFTW benchmark. Altix uses ecc -O3. We were unable to run FFTW successfully on the Cray X1, in part we suspect, is that FFTW is targeted toward non-vector architectures.
The following graph plots 1-D FFT performance using the vendor library (-lscs, -lsci or -lessl), initialization time is not included. Revised 10/20/03
MESSAGE-PASSING BENCHMARKS |
Internode communication can be accomplished with IP, PVM, or MPI. We report MPI performance over the Alpha Quadrics network and the IBM SP. Each SP node (4 CPUs) share a single network interface. However, each CPU is a unique MPI end point, so one can measure both inter-node and intra-node communication. The following table summarizes the measured communication characteristics between nodes of the X1, Alpha, SP3, and the SP4. SP4 is currently based on Colony switch via PCI. Latency is for 8-byte message. Altix consists of two 128-cpu images (or "nodes"). Unless otherwise noted, Altix message times are within a "node".
The following graph shows bandwidth for communication between two processors on the same node using MPI from both EuroBen's mod1h and ParkBench comms1. Within a node, shared memory can be used by MPI. Revised 10/21/03
The sp4 is presently equiped with a Colony switch for inter-node communcation but is limited by the PCI interface at this time (May, 2002). Inter-node bandwidth (2x128-cpu OS images) is about half intra-node (see shared-memory plot below).
The following graph shows the minimum latency (one-way, e.g.,
half of RTT) for an 8 byte message from
CPU 0 to the other CPUs.
The red is for our older 64 processor configuration, the green is
for 128 CPUs, and the blue is
our 2x128 configuration.
The Altix MPI uses the distributed memory for MPI even across the
two (128 CPU) system images.
Revised 2/14/05
The following figure compares the effect of dplace when running
the same latency test.
The HALO benchmark is a synthetic benchmark that simulates the nearest neighbour exchange of a 1-2 row/column "halo" from a 2-D array. This is a common operation when using domain decomposition to parallelize (say) a finite difference ocean model. There are no actual 2-D arrays used, but instead the copying of data from an array to a local buffer is simulated and this buffer is transfered between nodes. The following compares the performance of MPI and OpenMP using 9 and 16 on the Altix. Revised 10/8/03
For comparsion, we have included the Halo result for the X1
and ORNL's SP4
in the following table from Wallcraft ('98).
(Revised 7/7/03)
The following graph plots the bandwidth for doing an exchange of messages for various message sizes. The MPI implementation uses a repetion IRECV/SEND/IWAIT for each message size, the SHMEM/co-array do a repetition of PUT's and then a synch. The Altix MPI uses MPI_BUFFER_MAX 2048. Revised 10/21/03.
The following pair of graphs shows aggregate exchange bandwidth when 1 pair of processors and 64 pairs of processors do an exchange (processor i exchanges with i+n/2). The Altix SHMEM test is probably unrealistic in that data is not invalidated in the cache the PUT's are repeated in the timing loop. Revised 4/4/04
The following graph compares MPI for the HALO exchange on 4 and 16 processors. Revised 10/8/03
The following table shows the performance of aggregate communication operations (barrier, broadcast, sum-reduction) using one processor per node (N) and all processors on each node(n). Recall that the sp4 has 32 processors per node (the sp3 and alpha, 4 per node). Communications is between MSP's on the X1 except for the UPC data. Times are in microseconds. (Revised 10/7/03)
A simple bisection bandwidth test has N/2 processors sending 1 MB messages to the other N/2. (Revised 2/13/05).
The following compares the aggregate MPI bandwidth for processor pairs
doing an exchange, where node i exchanges with node i+n/2. For smaller messages, the Altix outperforms the X1.
(For X1-n, n represents the number of processors.)
The second figure shows the effective per-pair bandwidth exchange data rate.
Revised 3/8/04
Preliminary testing of TCP/IP performance over the local LAN showed that the Altix GigE interfaces could run TCP at 570 Mbs. We have experimented with Web100/net100 modifications to the Altix Linux kernel to accelerate wide area TCP performance.
SHARED-MEMORY BENCHMARKS |
The following table shows the performance of thread/join in C as the master thread creates two, three, and four threads. The test repeatedly creates and joins threads.
The following table shows the time required to lock-unlock using pthread_mutex_lock with various number of threads. For the IBMs we use setenv SPINLOOPTIME 5000. Revised 10/20/03
The following table compares the performance of simple C barrier program using a single lock and spinning on a shared variable along with pthread_yield. Revised 10/20/03
The following graph compares the echo (ping-pong) bandwith of SHMEM (put's)
and MPI (ParkBench comms1) between two Altix CPUs.
MPI performance is improved with setenv MPI_BUFFER_MAX 2048.
The graph shows performance within a node and between nodes.
Revised 10/20/03
The following graph illustrates the aggregrate SHMEM for a 1 MB put and get
to processor 0 which is doing a sleep().
The graph also illustrates the memory contention when
one processor is running STREAM (triad data plotted)
at the same time as 0 or more processors are either doing continuous
1 MB SHMEM put's or get's with the STREAM processor.
With previous software, behavior was quite strange, but now little
interference is exhibited. Contrast this with our
Cray X1 results.
Revised 10/21/03
The following plot illustrates a SHMEM GET hotspot, where one or more processors are all trying to fetch the same 64-bit word from processor 0. The Y-axis is the average time (microseconds) for the SHMEM GET. Updated 7/9/04
The following illustrates the time to pass a 64-bit "hot potato" from one processor to the next using SHMEM PUT with each processor spinning on the "volatile shared" variable. The Y-axis is the average time for a single revolution. Updated 7/9/04
The following graph compares the HALO exchange times on 4 and 16 processors. The IBM is using OpenMP and the X1 co-arrays. Revised 10/8/03
HALO performance on 16 processors is illustrated in the next plot.
The following table compares FORTRAN OpenMP for the Altix, Alpha, and SP with co-arrays on the X1 when doing a simple, double-precision Jacobi iteration (1000x1000, tolerance = 10^-6). Revised 10/20/03
PARALLEL KERNEL BENCHMARKS |
Both ParkBench and EuroBen (euroben-dm) had MPI-based parallel kernels. However, the euroben-dm communication model was to have the processes do all of their send's before issuing receive's. On the SP, this model resulted in deadlock for the larger problem sizes. The EAGER_LIMIT can be adjusted to make some progress on the SP3 but the deadlocks could not be completely eliminated, so we report only ParkBench MPI results.
The following table show MPI parallel performance of the ParkBench LU (64x64x64) and FT benchmarks (256x256x128) for the Altix, X1, Alpha and SP. This is a small problem size, and doesn't permit the X1 to fill the vector pipes. These tests used standard FORTRAN (no vendor libraries). Revised 10/20/03
The following plots the aggregate Mflop performance for ParkBench QR factorization (MPI) of 1000x1000 double precision matrix using the vendor scientific libraries (scs/essl/cxml/sci). This benchmark uses BLACS (SCALAPACK). The small problem size results in small vectors and poor X1 performance. These results are using the ParkBench version of BLACS. (Revised 10/20/03)
As a further test of SCALAPACK performance, we compare the vendor libraries for matrix mutliply (pdgemm) and LU factorization (pdgetrf) of 8000x8000 double precision matrices using a blocksize of 32. The Cray X1 does well on the distributed matrix multiply, but not on the LU factorization. The Altix uses -lscs and netlib.org's scalapack library with BLACS/MPI. Revised 10/20/03
The following plot shows the performance of high-performance Linpack (HPL) on 16 processors for the Cray X1 and IBM p690 with MPI and the vendor BLAS. HPL solves a (random) dense linear system in double precision (64 bits) using: Two-dimensional block-cyclic data distribution - Right-looking variant of the LU factorization with row partial pivoting featuring multiple look-ahead depths - Recursive panel factorization with pivot search and column broadcast combined - Various virtual panel broadcast topologies - bandwidth reducing swap-broadcast algorithm - backward substitution with look-ahead of depth 1. Cray has reported 90% of peak using SHMEM instead of MPI. Revised 10/20/03
The following graph shows the aggregate Mflops for a multi-grid (MG) kernel from ParkBench/NAS Parallel Benchmark. This for a 256x256x256 doubleword grid with MPI and Wallcraft's co-array version and also OpenMP on the IBM. Revised 10/20/03
The following graph shows the aggregate Mflops for a conjugate gradient (CG) kernel (CLASS=A) from NAS Parallel Benchmarks 2.3 using MPI and OpenMP. Revised 10/20/03
We also ran the OpenMP version of the NAS Parallel Benchmarks (PBN-O-3.0b4). The following table compares the performance of three of those benchmarks on the power4 to the NERSC Power3 (seaborg, 16-way shared memory, 375 MHz). Revised 10/20/03
links |
Mathematical, Information, and Computational Sciences Division, within the Office of Advanced Scientific Computing Research of the Office of Science, Department of Energy. The application-specific evaluations are also supported by the sponsors of the individual applications research areas.