Recent updates:
The results on this page are just low-level benchmarks, see Worley's Cray X1 evaluation page for higher level benchmarks and application results.
NOTE The tests on the Cray X1 are using early versions of the compilers and libraries, and we expect performance to continue to improve with new releases. We are still discovering optimal compiler settings etc.. Results below are for 1 MSP, the minimum addressable MPI unit, but it also implies 4 SSP's are being used for "streaming" -- the compiler does streaming by default.
|
Oak Ridge National Laboratory (ORNL) is currently performing an in-depth evaluation of the Cray X1 system as part of its evaluation of early systems project. The primary tasks of the evaluation are to
ARCHITECTURE |
The Cray X1 at ORNL has 128 nodes as of August, 2003. Each node has 4 MSPs, each MSP has 4 SSPs, and each SSP has two vector units. The Cray "processor/CPU" in the results below is one MSP. All 4 MSP's on a node coherently share memory, and all memory is global (shmem). The Power4 consists of one node with 32 processors (4 MCM's) sharing memory. Both the Alpha and SP3 consist of four processors sharing memory on a single node. The following table summarizes the main characteristics of the machines
BENCHMARKS |
We have used widely available benchmarks in combination with our own custom benchmarks to characterize the performance of the X1. Some of the older benchmarks may need to be modified for these newer faster machines -- increasing repetitions to avoid 0 elapsed times, increasing problem sizes to test out of cache performance. Unless otherwise noted, the following compiler switches were used on the Alpha and SP.
MEMORY PERFORMANCE |
The X1 MSP has a 2MB L2 cache shared among the four SSPs. Both the SP3 and the Alpha have 64 KB L1 caches and 8 MB L2 caches. The SP4 has a 32KB L1 (FIFO), a 1.4 L2 (shared between 2 processors), and a 128 MB L3. The following figure shows the data rates for a simple FORTRAN loop to load ( y = y+x(i)), store (y(i)=1), and copy (y(i)=x(i)), for different vector sizes. Data is also included for four threads. (Beware of the linear interpolation between data points, and note we need to extend the test beyond 128 MB to get out of the SP4 L3 cache. It has been suggested the the "dcbz" SP4 instruction that allocates the target cache line in the L2 without loading it from memory first could further improve SP4 performance. Also see McCalpin's stream2 benchmark.) (Revised 4/10/03)
The
MAPS benchmark also characterizes memory access performance.
Plotted are load/store bandwidth for sequential (stride 1) and random
access.
Load is calculated from s=s+x(i)*y(i) and store from
x(i)= s.
For comparison we include the NEC SX-6 data from the MAPS website.
(Revised 4/8/03).
The tabletoy benchmark (C) makes random writes of 64-bit integers, parallelization is permitted with possibly non-coherent updates. The X1 number is for vectorizing the inner loop (multistreaming was an order of magnitude slower 88 MBs). Data rate in the following table is for a 268MB table. We include multi-threaded altix, opteron, sp3 (NERSC), and sp4 data as well. Revised 10/21/03
The stream benchmark is a program that measures main memory throughput for several simple operations. The aggregate data rate for multiple threads is reported in the following table. Recall, that the "peak" memory data rate for the X1 is 24 GBs/MSP, Alpha is 5.2 GBs, p690 is 51 GBs/MCM, and for the SP3 is 1.6 GBs. Data for the 16-way SP3 (375 Mhz, Nighthawk II) at NERSC is included too. Data for the Alpha ES45 (1 GHz, 8 GBs memory bw) is obtained from the streams data base. Data for p690/sp4 is with affinity enabled (6/1/02). The X1 uses (aprun -A) and we include the data rates for a single SSP as well as the aggregate rate for 4 SSP's running separate copies of the single stream test. All tests use stride 1. (Revised 8/13/03)
We have included our X1 results for 16 MSPs
in the following table from the
stream top 20 (5/27/03).
Machine ID ncpus COPY SCALE ADD TRIAD
------------------------------------------------------------------------
NEC_SX-7 32 876174.7 865144.1 869179.2 872259.1
NEC_SX-5-16A 16 607492.0 590390.0 607412.0 583069.0
SGI_Altix_3000 256 414573.0 412108.0 485323.0 488274.0
NEC_SX-4 32 434784.0 432886.0 437358.0 436954.0
HP_AlphaServer_GS1280 64 347712.3 341890.6 373126.5 377727.8
Cray_T932_321024-3E 32 310721.0 302182.0 359841.0 359270.0
CRAY X1 MSP 16 306891.9 296893.8 334403.7 311499.9 <---
NEC_SX-6 8 202627.2 192306.2 190231.3 213024.3
Cray_C90 16 105497.0 104656.0 101736.0 103812.0
SGI_Origin3800-500 256 87019.5 85514.4 101695.6 99680.2
HP_Integrity_SuperDome 64 82695.0 82476.0 83013.0 84223.0
IBM_eServer_p690+ 32 51455.0 53425.0 58651.0 58891.0
Sun_F15K 72 54665.4 47703.7 46090.7 50724.3
SGI_Origin2000-250 256 42824.2 43213.5 48285.8 49275.5
Cray_SV1ex 32 42317.8 42237.9 47829.8 47821.9
HP_AlphaServer_ES80 8 39898.0 40532.0 44519.0 44467.0
IBM_eServer_p670+ 16 32947.0 33673.0 35925.0 36818.0
IBM_eServer_p690_Turbo 32 28611.0 28994.0 32222.0 32249.0
Cray_Y-MP 8 19291.6 19294.2 26588.9 26802.2
HP_SuperDome_750 64 25762.3 21769.9 25675.0 26549.2
IBM_eServer_p690_HPC 16 20267.0 20265.0 24706.0 25058.0
The following plot illustrates the triad memory bandwidth as a function of
stride.
On the X1, bandwidth improves with no-caching ( !dir$ no_cache_alloc a,b,c ).
The hint benchmark measures computation and memory efficiency as the problem size increases. (This is C hint version 1, 1994.) The following graph shows the performance of a single processor for the X1 (12.2 MQUIPS), Alpha (66.9 MQUIPS), Altix (88.2 MQUIPS), and SP4 (74.9 MQUIPS). The L1 and L2 cache boundaries are visible, as well as the Altix and SP4's L3.
The lmbench benchmark measures various UNIX and system characeristics. Here are some preliminary numbers for runs on a service and compute node of alpha and SP3/4 (version 2) and X1 MSP. Some of the X1 numbers are relatively slow or may not be accurate.
LOW LEVEL BENCHMARKS (single processor) |
The following table compares the performance of the X1, Alpha, and SP for basic CPU operations. These numbers are from the first few kernels of EuroBen's mod1ac. The 14th kernel (9th degree poly) is a rough estimate of peak FORTRAN performance since it has a high re-use of operands. We supply both SSP and MSP results for the X1. (Revised 7/30/03)
The following table compares the performance of various intrinsics (EuroBen mod1f). For the SP, it also shows the effect of -O4 optimization versus -O3. (Revised 7/30/03)
The following graph compares the vendor library implementation of an LU factorization (DGETRF) using partial pivoting with row interchanges.
The following graph compares the vendor library implementation of DAXPY. The effect of the cache is apparent for both the X1 (2 MB) and Altix (6 MB). Revised 2/9/04
The following graph compares optimized FORTRAN performance (no sci/essl/cxml) for Euroben mod2a, matrix-vector dot product and product. (Revised 4/8/03)
The following table compares the single processor performance (Mflops) of the Alpha and IBMs for the Euroben mod2g, a 2-D Haar wavelet transform test. (Revised 4/8/03)
The following plots the performance (Mflops) of Euroben mod2d, a dense eigenvalue test, for both optimized FORTRAN and using the BLAS from the vendor library. For the Alpha, -O4 optimization failed, so this data uses -O3. (Revised 5/14/04)
The following plots the performance (iterations/second) of Euroben mod2e, a sparse eigenvalue test (no vendor libraries). At this time, the Cray FORTRAN compiler seems unable to either vectorize or stream this code. (Revised 4/9/03)
The following figures shows the FORTRAN Mflops for one processor for various problem sizes for the EuroBen mod2f, a 1-D FFT (complex to complex). The first plot includes initialization in the mflops, the second plot is for the transform only. (Revised 8/1/03).
The following compares a 1-D FFT using the FFTW benchmark. We were unable to run FFTW successfully on the Cray X1, in part we suspect, is that FFTW is targeted toward non-vector architectures.
The following graph plots 1-D FFT performance using the vendor library (-lsci or -lessl), initialization time is not included. Revised 5/14/04
MESSAGE-PASSING BENCHMARKS |
Internode communication can be accomplished with IP, PVM, or MPI. We report MPI performance over the Alpha Quadrics network and the IBM SP. Each SP node (4 CPUs) share a single network interface. However, each CPU is a unique MPI end point, so one can measure both inter-node and intra-node communication. The following table summarizes the measured communication characteristics between nodes of the X1, Alpha, SP3, and the SP4. SP4 is currently based on Colony switch via PCI. Latency is for 8-byte message. (Revised 11/26/03)
The following graph shows bandwidth for communication between two processors on the same node using MPI from both EuroBen's mod1h and ParkBench comms1. Within a node, shared memory can be used by MPI. (Revised 4/9/03).
The sp4 is presently equiped with a Colony switch for inter-node communcation but is limited by the PCI interface at this time (May, 2002). (Revised 4/10/03)
The following graph shows the minimum latency (one-way, e.g., half of RTT) for an 8 byte message from MSP 0 to more distant MSPs. The red is our previous data with 64 MSPs, the green is data from our 128 MSP configuration, the blue is from 11/26/03, and the light blue is on our 512 MSP configuration (6/2/04).
The HALO benchmark is a synthetic benchmark that simulates the nearest neighbour exchange of a 1-2 row/column "halo" from a 2-D array. This is a common operation when using domain decomposition to parallelize (say) a finite difference ocean model. There are no actual 2-D arrays used, but instead the copying of data from an array to a local buffer is simulated and this buffer is transfered between nodes. The following compares the performance of MPI, co-arrays, and SHMEM using 9 and 16 X1 MSPs and OpenMP on 9 and 16 SSPs. Revised 12/17/03.
For comparsion, we have included the Halo result for the X1
and ORNL's SP4
in the following table from Wallcraft ('98).
(Revised 12/17/03)
The following graph compares MPI for the HALO exchange on 4 and 16 processors. For smaller message sizes, the IBM outperforms the X1. It is intersting that the X1 times are much higher than its 8-byte message latency. Revised 11/26/03
The following table shows the performance of aggregate communication operations (barrier, broadcast, sum-reduction) using one processor per node (N) and all processors on each node(n). Recall that the sp4 has 32 processors per node (sp3 and alpha, 4 per node). Communications is between MSP's on the X1 except for the UPC data. Times are in microseconds. (Revised 7/7/03)
A simple bisection bandwidth test has N/2 processors sending 1 MB messages to the other N/2. (Revised 6/2/04).
The following compares the aggregate MPI bandwidth for processor pairs
doing an exchange, where node i exchanges with node i+n/2. For smaller messages, the Altix outperforms the X1
(For X1-n, n represents the number of processors.)
The second figure shows the effective per-pair bandwidth exchange data rate.
Revised 6/2/04
External networking
The Cray X1 uses a Linux network service processor connected by IP over fiber channel to the X1 and by GigE (jumbo) to the local network. ORNL's Steven Carter has provided the following TCP performance results for the X1 and its network frontend. When the X1 does a TCP connect, the window-scale is set to 4, which implies max socket buffer size of 1 MB. For a listening TCP connection, the X1 advertises a window-scale of only 3. UDP datarates are poor for the X1. With 1460 datagrams the Cray can only send/receive at about 8 mbs. With 8192 datagrams (network front is attached to jumboframe GigE net), the X1 UDP rate is only 50 mbs. It almost appears that the X1 has packets-per-second limit, so the bigger the frames the better.
Carter combined the Net100/Web100 kernel with Cray's Linux kernel on our secondary X1 network servers (CNS1). Even without using the Net100 tuning daemon, the autotuning feature and the larger windows of the Net100 kernel improved wide area network TCP transfers from the X1 to LBL by a factor of 4.
The best end-user data rates for file transfers (wide-area) are about 400 mbs using bbcp (parallel TCP streams).
SHARED-MEMORY BENCHMARKS |
The X1 pthreads model permits up to 16 SSP threads (-h ssp) or 4 MSP threads where each thread can also be multithreaded on each MSP. The following table shows the performance of thread/join in C as the master thread creates two, three, and four threads. The test repeatedly creates and joins threads. Revised 12/17/03.
The following table compares the performance of simple C barrier program using a single lock and spinning on a shared variable along with pthread_yield. Revised 12/17/03
The following graph plots the ping-pong bandwidth between two MSP nodes on the X1 using MPI, co-arrays, and SHMEM. A synch is done after each put and get.
The following plots the aggregate memory bandwidth for different number of MSP's doing a put or get of repeated 1 MB messages to the same target node which is doing a sleep(). Each MSP has its own array on the target. Revised 2/27/04.
The following graphs illustrate the effect of "distance" on get's
and put's for both co-arrays and shmem.
Node 0 is put/get'ing from node n-1 in the following plots.
Revised 2/27/04
The following plot illustrates the latency in doing a "put" or "get" of a double-word (8 bytes) with coarrays. Note the slower time when the target processor is on the same node -- MSP's share a cache within a node. Revised 3/15/04
The following plot illustrates a SHMEM GET hotspot, where one or more processors are all trying to fetch the same 64-bit word from processor 0. The Y-axis is the average time (microseconds) for the SHMEM GET. Updated 7/9/04
The following illustrates the time to pass a 64-bit "hot potato" from one processor to the next using SHMEM PUT with each processor spinning on the "volatile shared" variable. The Y-axis is the average time for a single revolution. Updated 7/9/04
The following graph plots the bandwidth for doing an exchange of messages for various message sizes. The MPI implementation uses a repetion IRECV/SEND/IWAIT for each message size, the SHMEM/co-array do a repetition of PUT's and then a synch. The X1 MPI is slower than the IBM for small messages. Revised 9/3/03.
The following pair of graphs shows aggregate exchange bandwidth when 1 pair of processors and 64 pairs of processors do an exchange (processor i exchanges with i+n/2). The Altix SHMEM test is probably unrealistic in that data is not invalidated in the cache the PUT's are repeated in the timing loop. Revised 4/9/04.
The following graph compares the HALO exchange times on 4 and 16 processors. The IBM is using OpenMP and the X1 co-arrays. Revised 11/26/03
HALO performance on 16 processors is illustrated in the next plot.
We have implemented the halo exchange in C (UPC) using subscripting and upc_memget. The following graph compares the performance of halo on 16 X1 MSPs.
The following graph illustrates the memory contention when
one X1 processor (MSP) is running STREAM (triad data plotted)
at the same time as 0 or more MSP's are either doing continuous
1 MB SHMEM put's or get's with the STREAM MSP.
Each MSP has its "own" 1MB area on MSP 0.
SHMEM data rate is aggregate.
Revised 9/2/03.
The data point with 0 processors is for a stand-alone triad.
We also provide stand-alone SHMEM put/get aggregate
bandwidth from one to N MSP's to an idle MSP (sleep()) using repeated 1 MB messages.
We run the same test using co-arrays in place of SHMEM.
Co-arrays provide higher aggregate throughput and seem to
show slightly less interference with STREAM.
Revised 11/29/04.
The following table compares FORTRAN OpenMP for the Alpha and SP with co-arrays on the X1 (MSP) and OpenMP on 4 MSP's and 4 SSPs on the X1 when doing a simple, double-precision Jacobi iteration on a 1000x1000 double precision array (tolerance = 10^-6). Note that the SP3 slows for 4 threads. Revised 12/17/03
The X1 supports OpenMP on up to 16 SSP's with in a node (4 MSP's) when compiled with -O ssp -O task1 , or you can compile in MSP mode and use up to 4 OpenMP threads (with OMP_NUM_THREADS or aprun -d 4) and get both streaming and threaded (e.g., again up to 16 SSP's). The IBM p690 OpenMP and their SMP version of essl ( -lpessl) support all 32 processors on a p690 node. All of the Altix processors can be used with OpenMP. We've done some testing with the OpenMP microbenchmarks. The following compares OpenMP performance between the X1 (SSPs), sp4, and the SGI Altix. Revised 12/23/03
The following graph compares the library DGEMM using 1 and 4 processors.
PARALLEL KERNEL BENCHMARKS |
Both ParkBench and EuroBen (euroben-dm) had MPI-based parallel kernels. However, the euroben-dm communication model was to have the processes do all of their send's before issuing receive's. On the SP, this model resulted in deadlock for the larger problem sizes. The EAGER_LIMIT can be adjusted to make some progress on the SP3 but the deadlocks could not be completely eliminated, so we report only ParkBench MPI results.
The following table show MPI parallel performance of the LU benchmark (64x64x64) for the X1, Alpha and SP. This is a small problem size, and doesn't permit the X1 to fill the vector pipes. These tests used standard FORTRAN (no vendor libraries). (Revised 4/8/03)
The following graph shows the aggregate Mflops for a conjugate gradient (CG) kernel (CLASS=A) from NAS Parallel Benchmarks 2.3 using MPI and OpenMP. Revised 9/22/03
The following graph illustrates how longer vectors can improve X1 performance. The NPB FT (A) benchmark (1-D double complex FFT w/MPI) uses a default blocking factor of 16. The graph shows that by increasing the blocking to 64, X1 performance is improved by a factor of three. For comparisons, the effect of the blocksize on scalar multiprocessors is illustrated as well. Tests do not use vendor FFT libs. Revised 12/18/03.
The following plots the aggregate Mflop performance for ParkBench QR factorization (MPI) of 1000x1000 double precision matrix using the vendor scientific libraries (essl/cxml/sci). This benchmark uses BLACS (SCALAPACK). Recall, that the X1 and SP4 have 16 CPUs sharing memory so we have included data (sp3-16) from the NERSC 16-way SP3 (375 MHz). The small problem size results in small vectors and poor X1 performance. These results are using the ParkBench version of BLACS. We saw little or no difference using CRAY's -lsci BLACS. (Revised 4/10/03)
As a further test of SCALAPACK performance, we compare the vendor libraries for matrix mutliply (pdgemm) and LU factorization (pdgetrf) of 8000x8000 double precision matrices using a blocksize of 32. The Cray X1 does well on the distributed matrix multiply, but not on the LU factorization (D'Azevedo suspects pivoting). Revised 9/21/03.
For comparison,
the vendor library single-processor
LU performance on a 1000x1000 is 1431 Mflops for the
IBM SP4 (-lessl), 1995 for the Alitx (-lscs),
and 3543 Mflops for the X1 (-lsci)
A test of SCALAPACK's (MPI) pzswap (double complex) doing both
row and col swaps
on the X1, Altix (libscalapack), and an SP4 (16 processors), shows
In contrast, the following plot shows the performance of high-performance Linpack (HPL) on 16 processors for the Cray X1 and IBM p690 with MPI and the vendor BLAS. HPL solves a (random) dense linear system in double precision (64 bits) using: Two-dimensional block-cyclic data distribution - Right-looking variant of the LU factorization with row partial pivoting featuring multiple look-ahead depths - Recursive panel factorization with pivot search and column broadcast combined - Various virtual panel broadcast topologies - bandwidth reducing swap-broadcast algorithm - backward substitution with look-ahead of depth 1. Cray has reported 90% of peak using SHMEM instead of MPI on the X1. (Revised 7/10/03)
links |
Mathematical, Information, and Computational Sciences Division, within the Office of Advanced Scientific Computing Research of the Office of Science, Department of Energy. The application-specific evaluations are also supported by the sponsors of the individual applications research areas.