Dunigan's Cray X1 Testing

ORNL Cray X1 Evaluation (Dunigan)

.... this is work in progress.... last revised

Recent updates:

10/7 scalapack matrix multiply Mflops was factor of 2 too small
7/9 shmem get hotspot and put hotpotato
6/2 512 MSP tests: MPI latency, bisection, exchange, allreduce, barrier
5/14 re-test with 5.2 ftn and lsci, dgemm mod2d mod2b fft
4/29 halo in upc tests
4/21 upc tests (supported in 5.2), MG tests, allreduce
3/15 coarray latency plot
3/10 pzswap update, column order now supported by -lsci, coarray get/put
3/8 aggregate MPI exchange
2/27 coarray and shmem get/put to distant processors, corrected coarray data in other plots
2/9/04 daxpy graph revised (used repetitions)
1/15/04 daxpy graph
12/26 SysV semaphore times
12/23 correct openmpbench times
12/18 NPB FT effects of vector length
12/17 pthreads tests (slow), openmp (ssp) halo, jacobi, and openmpbench
12/16 triad stride plot, openmp 16 ssp tests
12/2 reran jitter tests, jitter has been reduced greatly from july, so I have removed the plots (here are old jitter plots )
allreduce up to 250 nodes
11/25 256 nodes, over 8 cabinets (half populated)
mpi test mpilat mpivd halo (lower latency), allreduce, bisection
latency down to 7.3 from 8 us
10/17 pzswap update (bug in -lsci)
9/22 -lsci retests, NAS CG added, LU with -lsci, adjust LU flop count, pzswap update
9/3 revised exchange data, mg data
9/2 revised SHMEM/STREAM contention graph, coarray updates (faster)
8/28 mpi allreduce
8/27 added some ALtix results
8/13 stream triad for 32 MSP's, STREAM for 1 and 4 SSPs
8/6 2nd DGEMM graph in shared-memory section, MG update
8/5 128 MSPs, update reduce,bcast, barrier, and min latency plots updated
7/31 vendor lib 1-D FFT
7/30 update mod1ac and mod1f ssp and msp
7/10 high performance linpack (mpi) hpl
7/8 MPI jitter tests
7/7 MPI results for > 32 MSPs (64 MSPs active)
6/10 Carter ports Net100 kernel to X1 network frontend (CNS1), dramatic WAN improvements!
5/16 Carter's TCP results and some UDP numbers
5/13 LMbench results
5/9 upc barrier
5/8 coarray/shmem performance and -p, ed's pzswap scalapack, exchange
5/7 co-array/stream contention
5/6 mod1ac for SSP, shmem/stream contention
4/23 Maps and MG update
4/22 coarray/shmem ping-pong, jacobi revision, scalapack
4/16 MG results, jacobi
4/15 HALO results
4/11 re-run tests after major upgrade 4/8
4/8 MAPS results

The results on this page are just low-level benchmarks, see Worley's Cray X1 evaluation page for higher level benchmarks and application results.

NOTE The tests on the Cray X1 are using early versions of the compilers and libraries, and we expect performance to continue to improve with new releases. We are still discovering optimal compiler settings etc.. Results below are for 1 MSP, the minimum addressable MPI unit, but it also implies 4 SSP's are being used for "streaming" -- the compiler does streaming by default.

The IBM and HP/Compaq data are from runs in 2001-2002.

Oak Ridge National Laboratory (ORNL) is currently performing an in-depth evaluation of the Cray X1 system as part of its evaluation of early systems project. The primary tasks of the evaluation are to

understand the memory hierarchy and vector capabilities of the X1
determine the most effective approaches to using the X1
evaluate benchmark and application performance, and compare with similar systems from other vendors;
evaluate system and system administration software reliability and performance;
predict scalability, both in terms of problem size and in number of processors.

The emphasis of the evaluation is on application-relevant studies for applications of importance to DOE. However, standard benchmarks are still important for comparisons with other systems. The results presented here are from standard benchmarks and some custom benchmarks and, as such, represent only one part of the evaluation. An IBM SP4, IBM Winterhawk II (noted as SP3 in the following tables and graphs) and Compaq Alpha ES40 at ORNL were used for comparison with the X1. The results below are in the following categories:

architecture -- configuration summaries
benchmarks -- benchmark descriptions
memory performance -- memory performance
low-level results -- base CPU and intrinsics results
shared-memory results -- single node/thread performance
message passing results -- latency/bandwith, collectives
kernels -- application kernel results

ARCHITECTURE

The Cray X1 at ORNL has 128 nodes as of August, 2003. Each node has 4 MSPs, each MSP has 4 SSPs, and each SSP has two vector units. The Cray "processor/CPU" in the results below is one MSP. All 4 MSP's on a node coherently share memory, and all memory is global (shmem). The Power4 consists of one node with 32 processors (4 MCM's) sharing memory. Both the Alpha and SP3 consist of four processors sharing memory on a single node. The following table summarizes the main characteristics of the machines

Specs: Altix Alpha SC SP3 SP4 X1 MHz 1500 667 375 1300 800 memory/node 512GB 2GB 2GB 32GB 16GB L1 32K 64K 64K 32K 16K (scalar) L2 256K 8MB 8MB 1.5MB 2MB (per MSP) L3 6MB 128MB peak Mflops 4*MHz 2*MHz 4*MHz 4*MHz 12.8 Gflops peak mem BW 6.4GBs 5.2GBs 1.6GBs 51 GBs/MCM 26 GBs/MSP alpha 2 buses @ 2.6 GBs each (alpha es45 8 GBs) The X1 MSP can fetch from memory at 25.6 GBs and store at 20.5 GBs. The p690 memory subsystem can provide 205 GBs per node, and the X1 memory can provide 204GBs per node. For the Alpha, nodes are interconnected with a Quadrics switch organized as a fat tree. The SP nodes are interconnected with cross-bar switches in an Omega-like network. The X1 uses a modified 2-D torus utilizing (I think) the SGI NUMAlink router chips (same as Altix, though Altix uses dual fat-tree and full cache coherency)

BENCHMARKS

We have used widely available benchmarks in combination with our own custom benchmarks to characterize the performance of the X1. Some of the older benchmarks may need to be modified for these newer faster machines -- increasing repetitions to avoid 0 elapsed times, increasing problem sizes to test out of cache performance. Unless otherwise noted, the following compiler switches were used on the Alpha and SP.

X1: -Oaggress,stream2 (arpun -n xxx -p 64k:16m a.out) Alpha: -O4 -fast -arch ev6 SP: -O4 -qarch=auto -qtune=auto -qcache=auto -bmaxdata:0x70000000 Benchmarks were in C, FORTRAN, and FORTRAN90/OpenMP. We also compared performance with the vendor runtime libraries, sci(X1), cxml (Alpha) and essl (SP). We used the following benchmarks in our tests:

ParkBench 2.1 -- provides low-level sequential and communication benchmarks, parallel linear algebra benchmarks, NAS parallel benchmarks, and compact application codes. Here is a summary of the benchmark modules. Codes are in FORTRAN. Results are often reported as least-squares fit of data. We report actual performance numbers.
EuroBen 3.9 -- provides serial benchmarks for low-level performance and applicaton kernels (linear algebra, eigen value, FFT, QR). Here is a summary of the benchmark modules. euroben-dm provides some communication and parallel (MPI) benchmarks. The web site includes results from other systems.
lmbench -- provides insight into OS (UNIX) performance and memory latencies. The web site includes results from other systems.
stream -- measures memory bandwidth for both serial and parallel configurations. Also we use the MAPS memory benchmark. The web sites include results from other systems.
Custom low-level benchmarks that we have used over the years in evaluating memory and communication performance.

For both the Alpha and the SP, gettimeofday() provides microsecond wall-clock time (though one has to be sure MICROTIME option is set in the Alpha OS kernel). Both have high-resolution cylce counters as well, but the Alpha cycle counter is only 32-bits so rolls over in less than 7 seconds. For distributed benchmarks (MPI), the IBM and Alpha systems provide a hardware synchronized MPI_Wtime() with microsecond resolution. On the Alpha, MPI_Wtime is frequency synchonized, but initial offsets are only approximate. (On the Alpha, it appears MPI_Init tries to provide an initial zero offset to the Elan counters on each node when an MPI job starts. On the SP3, we discovered several nodes that were not synchronized, a patch was eventually provided.) Time is not syncrhonized on the X1.

MEMORY PERFORMANCE

The X1 MSP has a 2MB L2 cache shared among the four SSPs. Both the SP3 and the Alpha have 64 KB L1 caches and 8 MB L2 caches. The SP4 has a 32KB L1 (FIFO), a 1.4 L2 (shared between 2 processors), and a 128 MB L3. The following figure shows the data rates for a simple FORTRAN loop to load ( y = y+x(i)), store (y(i)=1), and copy (y(i)=x(i)), for different vector sizes. Data is also included for four threads. (Beware of the linear interpolation between data points, and note we need to extend the test beyond 128 MB to get out of the SP4 L3 cache. It has been suggested the the "dcbz" SP4 instruction that allocates the target cache line in the L2 without loading it from memory first could further improve SP4 performance. Also see McCalpin's stream2 benchmark.) (Revised 4/10/03)

The MAPS benchmark also characterizes memory access performance. Plotted are load/store bandwidth for sequential (stride 1) and random access. Load is calculated from s=s+x(i)*y(i) and store from x(i)= s. For comparison we include the NEC SX-6 data from the MAPS website. (Revised 4/8/03).

The tabletoy benchmark (C) makes random writes of 64-bit integers, parallelization is permitted with possibly non-coherent updates. The X1 number is for vectorizing the inner loop (multistreaming was an order of magnitude slower 88 MBs). Data rate in the following table is for a 268MB table. We include multi-threaded altix, opteron, sp3 (NERSC), and sp4 data as well. Revised 10/21/03

MBs (using wallclock time) sp4-1 26 altix-1 42 X1-msp-1 1190 opteron-1 33 sp3-1 8 sp4-2 47 altix-2 45 opteron-2 64 sp3-2 26 sp4-4 98 altix-4 62 sp3-4 53 sp4-8 174 altix-8 86 sp3-8 90 sp4-16 266 altix-16 69 sp3-16 139 sp4-32 322 altix-32 77

The stream benchmark is a program that measures main memory throughput for several simple operations. The aggregate data rate for multiple threads is reported in the following table. Recall, that the "peak" memory data rate for the X1 is 24 GBs/MSP, Alpha is 5.2 GBs, p690 is 51 GBs/MCM, and for the SP3 is 1.6 GBs. Data for the 16-way SP3 (375 Mhz, Nighthawk II) at NERSC is included too. Data for the Alpha ES45 (1 GHz, 8 GBs memory bw) is obtained from the streams data base. Data for p690/sp4 is with affinity enabled (6/1/02). The X1 uses (aprun -A) and we include the data rates for a single SSP as well as the aggregate rate for 4 SSP's running separate copies of the single stream test. All tests use stride 1. (Revised 8/13/03)

MBs copy scale add triad X1 22111 21634 23658 23752 X1-ssp-1 8167 8002 11061 10763 X1-ssp-4 30437 30431 41465 40715 copy: x(i)=y(i) alpha1 1339 1265 1273 1383 scale: x(i)=A*y(i) es45-1 1946 1941 1978 1978 add: z(i)=x(i)+y(i) SP3 1 523 561 581 583 triad: z(i)=x(i)+A*y(i) SP3/16-1 486 494 601 601 SP4-1 1774 1860 2098 2119 The following log-log plot shows the aggregrate shared memory performance for the STREAM triad operation. We plot both X1 MSP and SSP performance. The SX-6 data is from the STREAM website.

We have included our X1 results for 16 MSPs in the following table from the stream top 20 (5/27/03).

Machine ID                     ncpus   COPY    SCALE    ADD    TRIAD
------------------------------------------------------------------------
NEC_SX-7                          32 876174.7 865144.1 869179.2 872259.1
NEC_SX-5-16A                      16 607492.0 590390.0 607412.0 583069.0
SGI_Altix_3000                   256 414573.0 412108.0 485323.0 488274.0
NEC_SX-4                          32 434784.0 432886.0 437358.0 436954.0
HP_AlphaServer_GS1280             64 347712.3 341890.6 373126.5 377727.8
Cray_T932_321024-3E               32 310721.0 302182.0 359841.0 359270.0
CRAY X1 MSP                       16 306891.9 296893.8 334403.7 311499.9  <---
NEC_SX-6                           8 202627.2 192306.2 190231.3 213024.3
Cray_C90                          16 105497.0 104656.0 101736.0 103812.0
SGI_Origin3800-500               256  87019.5  85514.4 101695.6  99680.2
HP_Integrity_SuperDome            64  82695.0  82476.0  83013.0  84223.0
IBM_eServer_p690+                 32  51455.0  53425.0  58651.0  58891.0
Sun_F15K                          72  54665.4  47703.7  46090.7  50724.3
SGI_Origin2000-250               256  42824.2  43213.5  48285.8  49275.5
Cray_SV1ex                        32  42317.8  42237.9  47829.8  47821.9
HP_AlphaServer_ES80                8  39898.0  40532.0  44519.0  44467.0
IBM_eServer_p670+                 16  32947.0  33673.0  35925.0  36818.0
IBM_eServer_p690_Turbo            32  28611.0  28994.0  32222.0  32249.0
Cray_Y-MP                          8  19291.6  19294.2  26588.9  26802.2
HP_SuperDome_750                  64  25762.3  21769.9  25675.0  26549.2
IBM_eServer_p690_HPC              16  20267.0  20265.0  24706.0  25058.0

The following plot illustrates the triad memory bandwidth as a function of stride. On the X1, bandwidth improves with no-caching ( !dir$ no_cache_alloc a,b,c ).

The hint benchmark measures computation and memory efficiency as the problem size increases. (This is C hint version 1, 1994.) The following graph shows the performance of a single processor for the X1 (12.2 MQUIPS), Alpha (66.9 MQUIPS), Altix (88.2 MQUIPS), and SP4 (74.9 MQUIPS). The L1 and L2 cache boundaries are visible, as well as the Altix and SP4's L3.

The lmbench benchmark measures various UNIX and system characeristics. Here are some preliminary numbers for runs on a service and compute node of alpha and SP3/4 (version 2) and X1 MSP. Some of the X1 numbers are relatively slow or may not be accurate.

LOW LEVEL BENCHMARKS (single processor)

The following table compares the performance of the X1, Alpha, and SP for basic CPU operations. These numbers are from the first few kernels of EuroBen's mod1ac. The 14th kernel (9th degree poly) is a rough estimate of peak FORTRAN performance since it has a high re-use of operands. We supply both SSP and MSP results for the X1. (Revised 7/30/03)

X1 alpha sp3 sp4 MSP SSP broadcast 516 368 1946 3773 1275 copy 324 295 991 2841 909 addition 285 186 942 2322 792 subtraction 288 166 968 2881 803 multiply 287 166 935 2887 793 division 55 64 90 612 160 dotproduct 609 655 2059 6250 1379 X=X+aY 526 497 1622 4662 1590 Z=X+aY 477 331 1938 4731 1588 y=x1x2+x3x4 433 371 2215 3879 1198 1st ord rec. 110 107 215 56 53 2nd ord rec. 136 61 268 50 47 2nd diff 633 743 1780 6949 1564 9th deg. poly 701 709 2729 10655 2896 basic operations (Mflops) euroben mod1ac

The following table compares the performance of various intrinsics (EuroBen mod1f). For the SP, it also shows the effect of -O4 optimization versus -O3. (Revised 7/30/03)

alpha sp3 -O4 sp3 -O3 sp4 -O4 X1-msp X1-ssp x**y 8.3 1.8 1.6 7.1 45.1 12.3 sin 13 34.8 8.9 64.1 94.4 23.5 cos 12.8 21.4 7.1 39.6 95.1 23.6 sqrt 45.7 52.1 34.1 93.9 648 153 exp 15.8 30.7 5.7 64.3 250 63.8 log 15.1 30.8 5.2 59.8 183 45.4 tan 9.9 18.9 5.5 35.7 80.8 19.2 asin 13.3 10.4 10.2 26.6 110 27.3 sinh 10.7 2.3 2.3 19.5 86.2 21.1 instrinsics (Mcalls/s) euroben mod1f (N=10000) The following table compares the performance (Mflops) of a simple FORTRAN matrix (REAL*8 400x400) multiply compared with the performance of DGEMM from the vendor math library (-lcxml for the Alpha, -lsci for the X1, -lessl for the SP). Note, the SP4 -lessl (3.3) is tuned for the Power4. Also the Mflops for 1000x1000 Linpack are reported from netlib except the sp4 number is from IBM. (Revised 4/8/03) alpha sp3 sp4 X1 ftn 72 45 220 7562 lib 1182 1321 3174 9482 linpack 1031 1236 2894 3955 In the following graph, the performance of the vendor library version of DGEMM is illustrated. The plot includes data from the new Compaq ES45 (1 GHz). The p690 achieves only 65% of peak because of insufficient rename registers. The X1, Alpha, and sp3 get a much higher percentage of peak. Also see another DGEMM plot in shared-memory section below. Revised 5/14/04.

The following graph compares the vendor library implementation of an LU factorization (DGETRF) using partial pivoting with row interchanges.

The following graph compares the vendor library implementation of DAXPY. The effect of the cache is apparent for both the X1 (2 MB) and Altix (6 MB). Revised 2/9/04

The following graph compares optimized FORTRAN performance (no sci/essl/cxml) for Euroben mod2a, matrix-vector dot product and product. (Revised 4/8/03)

The following table compares the single processor performance (Mflops) of the Alpha and IBMs for the Euroben mod2g, a 2-D Haar wavelet transform test. (Revised 4/8/03)

|----------------------------------------------------------------- | Order | alpha | SP3 | SP4 | X1 | | n1 | n2 | (Mflop/s) | (Mflop/s) | (Mflop/s) | (Mflop/s)| |----------------------------------------------------------------- | 16 | 16 | 142.56 | 79.629 | 126.42 | 10.5 | | 32 | 16 | 166.61 | 96.690 | 251.93 | 13.8 | | 32 | 32 | 208.06 | 115.43 | 301.15 | 20.0 | | 64 | 32 | 146.16 | 108.74 | 297.26 | 22.7 | | 64 | 64 | 111.46 | 111.46 | 278.45 | 25.9 | | 128 | 64 | 114.93 | 101.49 | 251.90 | 33.3 | | 128 | 128 | 104.46 | 97.785 | 244.45 | 48.5 | | 256 | 128 | 86.869 | 64.246 | 179.43 | 45.8 | | 256 | 256 | 71.033 | 44.159 | 103.52 | 46.7 | | 512 | 256 | 65.295 | 41.964 | 78.435 | 52.1 | |----------------------------------------------------------------- The following plots the performance (Mflops) of Euroben mod2b, a dense linear system test, for both optimized FORTRAN and using the BLAS from the vendor library (cxml/essl). (Revised 5/14/04)

The following plots the performance (Mflops) of Euroben mod2d, a dense eigenvalue test, for both optimized FORTRAN and using the BLAS from the vendor library. For the Alpha, -O4 optimization failed, so this data uses -O3. (Revised 5/14/04)

The following plots the performance (iterations/second) of Euroben mod2e, a sparse eigenvalue test (no vendor libraries). At this time, the Cray FORTRAN compiler seems unable to either vectorize or stream this code. (Revised 4/9/03)

The following figures shows the FORTRAN Mflops for one processor for various problem sizes for the EuroBen mod2f, a 1-D FFT (complex to complex). The first plot includes initialization in the mflops, the second plot is for the transform only. (Revised 8/1/03).

The following compares a 1-D FFT using the FFTW benchmark. We were unable to run FFTW successfully on the Cray X1, in part we suspect, is that FFTW is targeted toward non-vector architectures.

The following graph plots 1-D FFT performance using the vendor library (-lsci or -lessl), initialization time is not included. Revised 5/14/04

MESSAGE-PASSING BENCHMARKS

Internode communication can be accomplished with IP, PVM, or MPI. We report MPI performance over the Alpha Quadrics network and the IBM SP. Each SP node (4 CPUs) share a single network interface. However, each CPU is a unique MPI end point, so one can measure both inter-node and intra-node communication. The following table summarizes the measured communication characteristics between nodes of the X1, Alpha, SP3, and the SP4. SP4 is currently based on Colony switch via PCI. Latency is for 8-byte message. (Revised 11/26/03)

alpha sp3 sp4 X1 latency (1 way, us) 5.4 16.3 17 7.3 (3.8 SHMEM, 3.9 coarray) bandwidth (echo, MBs) 199 139 174 12125 MPI within a node 622 512 2186 latency (min, 1 way, us) and bandwidth (MBs) -- latency Bandwidth (min 1 way us, MBs) X1 node 7.3 11776 X1 MSP 7.3 12125 altix cpu 1.1 1968 alitx node 1.1 1955 alpha node 5.5 198 alpha cpu 5.8 623 alpha IP-sw 123 77 alpha IP-gigE/1500 76 44 alpha IP-100E 70 11 sp3 node 16.3 139 sp3 cpu 8.1 512 sp4 node 7 975 (Federation) sp4 node 6 1702 (Federation dual rail) sp4 node 17 174 (PCI/Colony) sp4 cpu 3 2186 sp3 IP-sw 82 46 sp3 IP-gigE/1500 91 47 sp3 IP-gigE/9000 136 84 sp3 IP-100E 93 12

The following graph shows bandwidth for communication between two processors on the same node using MPI from both EuroBen's mod1h and ParkBench comms1. Within a node, shared memory can be used by MPI. (Revised 4/9/03).

The sp4 is presently equiped with a Colony switch for inter-node communcation but is limited by the PCI interface at this time (May, 2002). (Revised 4/10/03)

The following graph shows the minimum latency (one-way, e.g., half of RTT) for an 8 byte message from MSP 0 to more distant MSPs. The red is our previous data with 64 MSPs, the green is data from our 128 MSP configuration, the blue is from 11/26/03, and the light blue is on our 512 MSP configuration (6/2/04).

The HALO benchmark is a synthetic benchmark that simulates the nearest neighbour exchange of a 1-2 row/column "halo" from a 2-D array. This is a common operation when using domain decomposition to parallelize (say) a finite difference ocean model. There are no actual 2-D arrays used, but instead the copying of data from an array to a local buffer is simulated and this buffer is transfered between nodes. The following compares the performance of MPI, co-arrays, and SHMEM using 9 and 16 X1 MSPs and OpenMP on 9 and 16 SSPs. Revised 12/17/03.

For comparsion, we have included the Halo result for the X1 and ORNL's SP4 in the following table from Wallcraft ('98). (Revised 12/17/03)

LATENCY (us) MACHINE CPUs METHOD N=2 N=128 Cray X1 16 co-array 36 31 IBM SP4 16 MPI 27 32 SGI Altix 16 SHMEM 14 40 Cray X1 16 OpenMP 13 13 (SSP) Cray X1 16 SHMEM 35 47 SGI Altix 16 OpenMP 15 48 Cray T3E-900 16 SHMEM 20 68 SGI Altix 16 MPI 19 72 SUN E10000 16 OpenMP 24 102 Cray X1 16 MPI 91 116 SGI O2K 16 SHMEM 36 113 SGI O2K 16 OpenMP 33 119 IBM SP4 16 OpenMP 58 126 HP SPP2000 16 MPI 88 209 IBM SP 16 MPI 137 222 SGI O2K 16 MPI 145 247 The Halo benchmarks also compares various algorithms within a given paradigm. The following compares the performance using various MPI methods on 16 MSPs for different problem sizes. Revised 11/26/03

The following graph compares MPI for the HALO exchange on 4 and 16 processors. For smaller message sizes, the IBM outperforms the X1. It is intersting that the X1 times are much higher than its 8-byte message latency. Revised 11/26/03

The following table shows the performance of aggregate communication operations (barrier, broadcast, sum-reduction) using one processor per node (N) and all processors on each node(n). Recall that the sp4 has 32 processors per node (sp3 and alpha, 4 per node). Communications is between MSP's on the X1 except for the UPC data. Times are in microseconds. (Revised 7/7/03)

mpibarrier (average us) X1 cpus alpha-N alpha-n sp3-N sp3-n sp4-n mpi shmem coarray upc 2 7 11 22 10 3 3 3.0 3.2 6.1 4 7 16 45 20 5 3 3.2 3.4 7.1 8 8 18 69 157 7 5 4.8 4.9 8.5 16 9 21 93 230 9 6 5.6 5.8 7.3 32 11 28 118 329 10 5 6.3 6.6 11.0 64 37 145 419 5 7.1 7.2 12.1 128 6 10.0 9.9 300 9 19.9 24.3 504 10 19.0 17.7 mpibcast (8 bytes) X1 cpus alpha-N alpha-n sp3-N sp3-n sp4-n mpi shmem coarray upc 2 9.6 12.5 5.4 6.7 3.2 5.9 1.4 .3 0 4 10.4 20.3 9.4 9.4 6.2 7.2 4.1 .8 0.5 8 11.4 28.5 13.4 17.5 8.4 10.5 10.0 1.2 1.0 16 12.5 32.9 17.0 20.9 9.8 16.3 20.4 1.9 1.2 32 13.8 41.4 19.3 24.1 11.3 27.5 41.6 4.0 1.5 64 48.7 23.6 30.8 48.1 83 7.9 2.7 mpireduce (SUM, doubleword) cpus alpha-N alpha-n sp3-N sp3-n sp4-n X1 2 9 11 8 9 6 8 4 190 207 29 133 9 11 8 623 350 271 484 13 15 16 1117 604 683 1132 18 19 32 3176 1991 1613 2193 29 23 64 5921 2841 3449 31 mpiallreduce (SUM, doubleword) cpus altix X1 2 5.7 11.4 4 14.5 16.3 8 22 24.9 16 30.3 31.6 32 39.3 45.5 48 47.4 53.2 64 58.9 66.1 96 80.0 The following compares the time for collective MPI_allreduce (doubleword sum). Also included are results for SHMEM, co-array, and upc. The X1 upc code uses a single lock for updating the shared global sum. Revised 6/12/04.

A simple bisection bandwidth test has N/2 processors sending 1 MB messages to the other N/2. (Revised 6/2/04).

Aggregate datarate (MBs) cpus sp4 alpha X1 Altix 2 138 195 12412 1074 half populated cabinets 11/26/03 4 276 388 16245 1150 8 552 752 15872 1304 16 1040 1400 32626 2608 32 3510 29516 2608 48 35505 5064 64 55553 5120 96 44222 7632 128 59292 10170 200 139536 252 168107 256 49664 full cabinets 6/2/04 300 68595 400 120060 500 158350 504 167832

The following compares the aggregate MPI bandwidth for processor pairs doing an exchange, where node i exchanges with node i+n/2. For smaller messages, the Altix outperforms the X1 (For X1-n, n represents the number of processors.) The second figure shows the effective per-pair bandwidth exchange data rate. Revised 6/2/04

External networking

The Cray X1 uses a Linux network service processor connected by IP over fiber channel to the X1 and by GigE (jumbo) to the local network. ORNL's Steven Carter has provided the following TCP performance results for the X1 and its network frontend. When the X1 does a TCP connect, the window-scale is set to 4, which implies max socket buffer size of 1 MB. For a listening TCP connection, the X1 advertises a window-scale of only 3. UDP datarates are poor for the X1. With 1460 datagrams the Cray can only send/receive at about 8 mbs. With 8192 datagrams (network front is attached to jumboframe GigE net), the X1 UDP rate is only 50 mbs. It almost appears that the X1 has packets-per-second limit, so the bigger the frames the better.

Carter combined the Net100/Web100 kernel with Cray's Linux kernel on our secondary X1 network servers (CNS1). Even without using the Net100 tuning daemon, the autotuning feature and the larger windows of the Net100 kernel improved wide area network TCP transfers from the X1 to LBL by a factor of 4.

The best end-user data rates for file transfers (wide-area) are about 400 mbs using bbcp (parallel TCP streams).

SHARED-MEMORY BENCHMARKS

Both the Alpha and IBMs consist of a cluster of shared-memory nodes, each node with four processors sharing a common memory (16 for X1 and 32 for sp4). The X1 is cache-coherent within a node, but the memory space is global across all nodes. We tested the performance of a shared-memory node with various C programs with explicit thread calls and with FORTRAN Open MP codes. We also evaluate the performance of SHMEM and co-arrays. We have yet to test the X1 explicit (pthread) thread mangement.

The X1 pthreads model permits up to 16 SSP threads (-h ssp) or 4 MSP threads where each thread can also be multithreaded on each MSP. The following table shows the performance of thread/join in C as the master thread creates two, three, and four threads. The test repeatedly creates and joins threads. Revised 12/17/03.

threads alpha sp3 sp4 altix x1 2 47.7 96 44 399 29695 3 165 152 68 842 55439 4 251 222 97 1241 79180 thread create/join time in microseconds (C) Often, it is more efficient to create the threads once, and then provide them work as needed. I suspect this is what FORTRAN Open MP is doing for "parallel do". The following table is the performance of parallel do. Revised 12/17/03 threads alpha sp3 sp4 altix x1 2 2.1 12.7 6.3 4.8 12.1 3 3.4 15.3 8.4 6.3 13.2 4 5.2 19.5 9.5 6.5 17.4 OPEN MP parallel DO (us)

The following table shows the time required to lock-unlock using pthread_mutex_lock with various number of threads. For the IBMs we use setenv SPINLOOPTIME 5000. Revised 12/17/03 threads alpha sp3 sp4 altix x1 1 0.26 0.6 0.3 0.07 4.9 2 1.5 1.4 1.3 2.6 295 3 17.8 2.1 1.6 41.5 1317 4 29.6 2.9 3.8 73.2 1703 time for lock/unlock (us) The graph to the right shows the time to lock/unlock a systemV semaphore when competing with other processors.

The following table compares the performance of simple C barrier program using a single lock and spinning on a shared variable along with pthread_yield. Revised 12/17/03

threads alpha sp3 sp4 altix x1 1 0.25 0.6 0.3 0.08 4.9 2 1.36 4.4 1.9 0.5 97.7 3 9.9 20.5 3.1 18.1 95.2 4 65 34.6 3.7 53.2 99.3 C barrier times (us) The following table illustrates linear speedup for an embarrassingly parallel integration. A C code with explicit thread management is compared with FORTRAN Open MP. Both just used -O optimization. The CRAY C pthread implementation does not scale well for SSPs or MSPs. Revised 4/21/03 fortran OpenMP x1 C threads UPC alpha sp3 sp4 altix ssp | alpha sp3 sp4 altix x1 msp | msp 1 252 102 251 891 676 | 166 52 216 558 669 2480 | 2650 2 502 204 501 1775 1354 | 331 104 432 1114 1154 2957 | 5053 3 748 306 752 2312 2026 | 496 157 648 1668 1513 2884 | 7340 4 990 408 1002 3519 2695 | 657 206 864 2221 1732 2805 | 10031 8 1999 6815 5336 | 1725 4410 1789 | 18038 16 3565 12039 10470 | 3429 8580 1254 | 43160 rectangle rule (Mflops) -O optimization The following table illustrates an explicit thread implementation of Cholesky factorization of a 1000x1000 double precision matrix in C (-O optimization). Revised 12/17/03 threads alpha sp3 sp4 altix x1 msp 1 150 125 350 196 525 476 2 269 238 631 341 848 733 3 369 353 1007 512 1096 942 4 435 390 1306 621 797 1087 cholp 1k matrix factor (mflops) -O optimization

The following graph plots the ping-pong bandwidth between two MSP nodes on the X1 using MPI, co-arrays, and SHMEM. A synch is done after each put and get.

The following plots the aggregate memory bandwidth for different number of MSP's doing a put or get of repeated 1 MB messages to the same target node which is doing a sleep(). Each MSP has its own array on the target. Revised 2/27/04.

The following graphs illustrate the effect of "distance" on get's and put's for both co-arrays and shmem. Node 0 is put/get'ing from node n-1 in the following plots. Revised 2/27/04

The following plot illustrates the latency in doing a "put" or "get" of a double-word (8 bytes) with coarrays. Note the slower time when the target processor is on the same node -- MSP's share a cache within a node. Revised 3/15/04

The following plot illustrates a SHMEM GET hotspot, where one or more processors are all trying to fetch the same 64-bit word from processor 0. The Y-axis is the average time (microseconds) for the SHMEM GET. Updated 7/9/04

The following illustrates the time to pass a 64-bit "hot potato" from one processor to the next using SHMEM PUT with each processor spinning on the "volatile shared" variable. The Y-axis is the average time for a single revolution. Updated 7/9/04

The following graph plots the bandwidth for doing an exchange of messages for various message sizes. The MPI implementation uses a repetion IRECV/SEND/IWAIT for each message size, the SHMEM/co-array do a repetition of PUT's and then a synch. The X1 MPI is slower than the IBM for small messages. Revised 9/3/03.

The following pair of graphs shows aggregate exchange bandwidth when 1 pair of processors and 64 pairs of processors do an exchange (processor i exchanges with i+n/2). The Altix SHMEM test is probably unrealistic in that data is not invalidated in the cache the PUT's are repeated in the timing loop. Revised 4/9/04.

The following graph compares the HALO exchange times on 4 and 16 processors. The IBM is using OpenMP and the X1 co-arrays. Revised 11/26/03

HALO performance on 16 processors is illustrated in the next plot.

We have implemented the halo exchange in C (UPC) using subscripting and upc_memget. The following graph compares the performance of halo on 16 X1 MSPs.

The following graph illustrates the memory contention when one X1 processor (MSP) is running STREAM (triad data plotted) at the same time as 0 or more MSP's are either doing continuous 1 MB SHMEM put's or get's with the STREAM MSP. Each MSP has its "own" 1MB area on MSP 0. SHMEM data rate is aggregate. Revised 9/2/03.

The data point with 0 processors is for a stand-alone triad. We also provide stand-alone SHMEM put/get aggregate bandwidth from one to N MSP's to an idle MSP (sleep()) using repeated 1 MB messages.

We run the same test using co-arrays in place of SHMEM. Co-arrays provide higher aggregate throughput and seem to show slightly less interference with STREAM. Revised 11/29/04.

The following table compares FORTRAN OpenMP for the Alpha and SP with co-arrays on the X1 (MSP) and OpenMP on 4 MSP's and 4 SSPs on the X1 when doing a simple, double-precision Jacobi iteration on a 1000x1000 double precision array (tolerance = 10^-6). Note that the SP3 slows for 4 threads. Revised 12/17/03

X1 openmp CPUs alpha sp3 sp4 ca msp ssp Altix 1 27 17 62 217 232 68 52 2 42 27 117 421 305 132 104 3 50 41 160 304 193 164 4 61 41 198 724 302 246 285 iterations per second

The X1 supports OpenMP on up to 16 SSP's with in a node (4 MSP's) when compiled with -O ssp -O task1 , or you can compile in MSP mode and use up to 4 OpenMP threads (with OMP_NUM_THREADS or aprun -d 4) and get both streaming and threaded (e.g., again up to 16 SSP's). The IBM p690 OpenMP and their SMP version of essl ( -lpessl) support all 32 processors on a p690 node. All of the Altix processors can be used with OpenMP. We've done some testing with the OpenMP microbenchmarks. The following compares OpenMP performance between the X1 (SSPs), sp4, and the SGI Altix. Revised 12/23/03

The following graph compares the library DGEMM using 1 and 4 processors.

PARALLEL KERNEL BENCHMARKS

Both ParkBench and EuroBen (euroben-dm) had MPI-based parallel kernels. However, the euroben-dm communication model was to have the processes do all of their send's before issuing receive's. On the SP, this model resulted in deadlock for the larger problem sizes. The EAGER_LIMIT can be adjusted to make some progress on the SP3 but the deadlocks could not be completely eliminated, so we report only ParkBench MPI results.

The following table show MPI parallel performance of the LU benchmark (64x64x64) for the X1, Alpha and SP. This is a small problem size, and doesn't permit the X1 to fill the vector pipes. These tests used standard FORTRAN (no vendor libraries). (Revised 4/8/03)

LU aggregate Mflops alpha sp3 sp4 X1 Altix 2 762 588 1377 639 976 4 1604 1188 2660 1339 1694 8 3265 2473 5310 2394 3275 16 5556 4771 9531 3884 5389 The following graph shows the aggregate Mflops for a multi-grid (MG) kernel from ParkBench/NAS Parallel Benchmark. This for a 256x256x256 doubleword grid with MPI and Wallcraft's co-array version and also OpenMP on the IBM. A Cray UPC (-h upc -O3) version of MG runs much slower, the difference due to the C implementation versus FORTRAN and not having much to do with UPC. Revised 4/22/04.

The following graph shows the aggregate Mflops for a conjugate gradient (CG) kernel (CLASS=A) from NAS Parallel Benchmarks 2.3 using MPI and OpenMP. Revised 9/22/03

The following graph illustrates how longer vectors can improve X1 performance. The NPB FT (A) benchmark (1-D double complex FFT w/MPI) uses a default blocking factor of 16. The graph shows that by increasing the blocking to 64, X1 performance is improved by a factor of three. For comparisons, the effect of the blocksize on scalar multiprocessors is illustrated as well. Tests do not use vendor FFT libs. Revised 12/18/03.

The following plots the aggregate Mflop performance for ParkBench QR factorization (MPI) of 1000x1000 double precision matrix using the vendor scientific libraries (essl/cxml/sci). This benchmark uses BLACS (SCALAPACK). Recall, that the X1 and SP4 have 16 CPUs sharing memory so we have included data (sp3-16) from the NERSC 16-way SP3 (375 MHz). The small problem size results in small vectors and poor X1 performance. These results are using the ParkBench version of BLACS. We saw little or no difference using CRAY's -lsci BLACS. (Revised 4/10/03)

As a further test of SCALAPACK performance, we compare the vendor libraries for matrix mutliply (pdgemm) and LU factorization (pdgetrf) of 8000x8000 double precision matrices using a blocksize of 32. The Cray X1 does well on the distributed matrix multiply, but not on the LU factorization (D'Azevedo suspects pivoting). Revised 9/21/03.

For comparison, the vendor library single-processor LU performance on a 1000x1000 is 1431 Mflops for the IBM SP4 (-lessl), 1995 for the Alitx (-lscs), and 3543 Mflops for the X1 (-lsci) A test of SCALAPACK's (MPI) pzswap (double complex) doing both row and col swaps on the X1, Altix (libscalapack), and an SP4 (16 processors), shows

matrix order swaptime (secs) X1 lsci scala SP4 Altix 4000 row 2.7 19.7 2.2 2.4 col 2.4 0.4 0.3 0.12 8000 row 12.7 80.7 9.2 9.5 col 9.3 0.8 0.8 0.7 20000 row 21.7 472.1 36.0 31.3 col 15.5 2.9 3.5 3.8 block = 64 16 processors distributed in ROW major order so col swaps are memory-to-memory Revised 3/10/04 The X1 swap times are dominated by bcopy and MPI pack/unpack.

In contrast, the following plot shows the performance of high-performance Linpack (HPL) on 16 processors for the Cray X1 and IBM p690 with MPI and the vendor BLAS. HPL solves a (random) dense linear system in double precision (64 bits) using: Two-dimensional block-cyclic data distribution - Right-looking variant of the LU factorization with row partial pivoting featuring multiple look-ahead depths - Recursive panel factorization with pivot search and column broadcast combined - Various virtual panel broadcast topologies - bandwidth reducing swap-broadcast algorithm - backward substitution with look-ahead of depth 1. Cray has reported 90% of peak using SHMEM instead of MPI on the X1. (Revised 7/10/03)

links

Worley's Cray X1 evaluation results
LBNL preliminary X1 evaluation '03 draft and SC04 paper
x1 mm5 performance
origin 3000 specs, x1 uses same router chips ? and here
scalapack
NAS parallel benchmarks
mac vector benchmarks
cray lc benchmark
Wallcraft's HALO results and info
NAS MG Benchmark OpenMP, MPI, SHMEM, co-array comparison
FFT info and blacs fft and C fftw.org
new hpcc benchmarks
llcbench combines blasbench mpbench and cachebench
DGEMM on sp3, NERSC
co-array.org
upc gwu's upc lbl's upc and upc examples
netpipe net ping pong perf evaluator, shmem, mpi
CUG Cray Users group

See SGI Altix or Cray XD1 or Power 4 results.

Research Sponsors

Mathematical, Information, and Computational Sciences Division, within the Office of Advanced Scientific Computing Research of the Office of Science, Department of Energy. The application-specific evaluations are also supported by the sponsors of the individual applications research areas.