Dunigan's SGI Altix Testing

ORNL SGI Altix Evaluation (Dunigan)

.... this is work in progress.... last revised

Recent updates:

1/25/05 MPI retest 1.1 us 1.9 GBs
10/7 scalapack matrix multiply Mflops was factor of 2 too small
7/9/04 SHMEM GET hotspot and PUT hotpotato
3/9/04 aggregate exchange bandwidth plots
3/2/04 allreduce update
2/9/04 daxpy graphs revised (used repetitions)
1/14/04 DAXPY plot
12/26 SysV semaphore times
12/23 openmpbench update
12/16 triad bw for diff. strides
12/1 load/store/copy fortran data graphs
11/25 FFTW
10/21 exchange MPI with MPI_BUFFER_MAX 2048, fix MPI bandwidth, retest SHMEM/STREAM contention (none now), fix tabletoy
10/20 retest -- 1.5 GHz 6 MB L3
10/7 MPI updates with mpirun + dplace (2x improvement)
10/6 DGEMM and LU with Intel MKL
9/22 NAS CG added, vendor lib LU, adjust LU flop count scalapack
9/19 256 node latency and dplace test
9/18 256 node latency test; steven carter adds Net100 to altix kernel
9/17 redo tests with new compilers Version 7.1, Build 20030814
9/15 128 CPUs, scalapack tests
9/8 C thread/lock tests
9/5 OpenMP NPB lu ft sp, MPI QR
9/3 OpenMP tests
9/2 SHMEM/STREAM contention
8/27 initial tests on 64p 1.3 GHz

The results on this page are just low-level benchmarks,

NOTE The tests on the SGI Altix are using early versions of the compilers and libraries, and we expect performance to continue to improve with new releases. We are still discovering optimal compiler settings etc..

Check our Cray X1 results for latest Cray numbers. The IBM and HP/Compaq data are from runs in 2001-2002.

Oak Ridge National Laboratory (ORNL) is currently performing an in-depth evaluation of the Cray X1 system as part of its evaluation of early systems project. The primary tasks of the evaluation are to

understand the memory hierarchy of the Altix
determine the most effective approaches to using the Altix
evaluate benchmark and application performance, and compare with similar systems from other vendors;
evaluate system and system administration software reliability and performance;
predict scalability, both in terms of problem size and in number of processors.

The emphasis of the evaluation is on application-relevant studies for applications of importance to DOE. However, standard benchmarks are still important for comparisons with other systems. The results presented here are from standard benchmarks and some custom benchmarks and, as such, represent only one part of the evaluation. An IBM SP4, IBM Winterhawk II (noted as SP3 in the following tables and graphs) and Compaq Alpha ES40 at ORNL were used for comparison with the X1 and Altix. The results below are in the following categories:

architecture -- configuration summaries
benchmarks -- benchmark descriptions
memory performance -- memory performance
low-level results -- base CPU and intrinsics results
shared-memory results -- single node/thread performance
message passing results -- latency/bandwith, collectives
kernels -- application kernel results

ARCHITECTURE

The Altix has 256 processors sharing memory. There are two system images (2x128), but the memory is global to all 256 processors (OpenMP, SHMEM, and MPI). The Cray X1 at ORNL has 128 nodes as of August, 2003. Each node has 4 MSPs, each MSP has 4 SSPs, and each SSP has two vector units. The Cray "processor/CPU" in the results below is one MSP. All 4 MSP's on a node share memory. The Power4 consists of one node with 32 processors (4 MCM's) sharing memory. Both the Alpha and SP3 consist of four processors sharing memory on a single node. The following table summarizes the main characteristics of the machines

Specs: Altix Alpha SC SP3 SP4 X1 MHz 1500 667 375 1300 800 memory/node 512GB 2GB 2GB 32GB 16GB L1 32K 64K 64K 32K 16K (scalar) L2 256K 8MB 8MB 1.5MB 2MB (per MSP) L3 6MB 128MB peak Mflops 4*MHz 2*MHz 4*MHz 4*MHz 12.8 Gflops peak mem BW 6.4GBs 5.2GBs 1.6GBs 51 GBs/MCM 26 GBs/MSP alpha 2 buses @ 2.6 GBs each (alpha es45 8 GBs) The p690 memory subsystem can provide 205 GBs per node, and the X1 memory can provide 204GBs per node. For the Alpha, nodes are interconnected with a Quadrics switch organized as a fat tree. The SP nodes are interconnected with cross-bar switches in an Omega-like network. The X1 uses a modified 2-D torus. The Altix uses a dual fat tree interconnect topology based on NUMALINK fabric.

BENCHMARKS

We have used widely available benchmarks in combination with our own custom benchmarks to characterize the performance of the X1. Some of the older benchmarks may need to be modified for these newer faster machines -- increasing repetitions to avoid 0 elapsed times, increasing problem sizes to test out of cache performance. Unless otherwise noted, the following compiler switches were used on the Alpha and SP.

X1: -Oaggress,stream2 (arpun -n xxx -p 64k:16m a.out) Alpha: -O4 -fast -arch ev6 SP: -O4 -qarch=auto -qtune=auto -qcache=auto -bmaxdata:0x70000000 Benchmarks were in C, FORTRAN, and FORTRAN90/OpenMP. We also compared performance with the vendor runtime libraries, sci(X1), cxml (Alpha), scs(Altix), and essl (SP). We used the following benchmarks in our tests:

ParkBench 2.1 -- provides low-level sequential and communication benchmarks, parallel linear algebra benchmarks, NAS parallel benchmarks, and compact application codes. Here is a summary of the benchmark modules. Codes are in FORTRAN. Results are often reported as least-squares fit of data. We report actual performance numbers.
EuroBen 3.9 -- provides serial benchmarks for low-level performance and applicaton kernels (linear algebra, eigen value, FFT, QR). Here is a summary of the benchmark modules. euroben-dm provides some communication and parallel (MPI) benchmarks. The web site includes results from other systems.
lmbench -- provides insight into OS (UNIX) performance and memory latencies. The web site includes results from other systems.
stream -- measures memory bandwidth for both serial and parallel configurations. Also we use the MAPS memory benchmark. The web sites include results from other systems.
Custom low-level benchmarks that we have used over the years in evaluating memory and communication performance.

For both the Alpha and the SP, gettimeofday() provides microsecond wall-clock time (though one has to be sure MICROTIME option is set in the Alpha OS kernel). Both have high-resolution cylce counters as well, but the Alpha cycle counter is only 32-bits so rolls over in less than 7 seconds. For distributed benchmarks (MPI), the IBM and Alpha systems provide a hardware synchronized MPI_Wtime() with microsecond resolution. On the Alpha, MPI_Wtime is frequency synchonized, but initial offsets are only approximate. (On the Alpha, it appears MPI_Init tries to provide an initial zero offset to the Elan counters on each node when an MPI job starts. On the SP3, we discovered several nodes that were not synchronized, a patch was eventually provided.) Time is not syncrhonized on the X1.

MEMORY PERFORMANCE

The X1 MSP has a 2MB L2 cache shared among the four SSPs. Both the SP3 and the Alpha have 64 KB L1 caches and 8 MB L2 caches. The SP4 has a 32KB L1 (FIFO), a 1.4 L2 (shared between 2 processors), and a 128 MB L3. The following figure shows the data rates for a simple FORTRAN loop to load ( y = y+x(i)), store (y(i)=1), and copy (y(i)=x(i)), for different vector sizes. Data is also included for four threads. (Beware of the linear interpolation between data points, and note we need to extend the test beyond 128 MB to get out of the SP4 L3 cache. It has been suggested the the "dcbz" SP4 instruction that allocates the target cache line in the L2 without loading it from memory first could further improve SP4 performance. Also see McCalpin's stream2 benchmark.) (Revised 12/1/03)

The MAPS benchmark also characterizes memory access performance. Plotted are load/store bandwidth for sequential (stride 1) and random access. Load is calculated from s=s+x(i)*y(i) and store from x(i)= s. Revised 10/20/03

The tabletoy benchmark (C) makes random writes of 64-bit integers in a shared memory, parallelization is permitted with possibly non-coherent updates. The X1 number is for vectorizing the inner loop (multistreaming was an order of magnitude slower 88 MBs). Data rate in the following table is for a 268MB table. We include multi-threaded altix, opteron, sp3 (NERSC), and sp4 data as well. Revised 10/21/03

MBs (using wallclock time) sp4-1 26 altix-1 42 X1-msp-1 1190 opteron-1 36 sp3-1 8 sp4-2 47 altix-2 45 opteron-2 65 sp3-2 26 sp4-4 98 altix-4 62 opteron-4 102 sp3-4 53 sp4-8 174 altix-8 86 sp3-8 90 sp4-16 266 altix-16 69 sp3-16 139 sp4-32 322 altix-32 77

The stream benchmark is a program that measures main memory throughput for several simple operations. The aggregate data rate for multiple threads is reported in the following table. Recall, that the "peak" memory data rate for the X1 is 24 GBs/MSP, Alpha is 5.2 GBs, p690 is 51 GBs/MCM, and for the SP3 is 1.6 GBs. Data for the 16-way SP3 (375 Mhz, Nighthawk II) at NERSC is included too. Data for the Alpha ES45 (1 GHz, 8 GBs memory bw) is obtained from the streams data base. Data for p690/sp4 is with affinity enabled (6/1/02). The X1 uses (aprun -A) and we include the data rates for a single SSP as well as the aggregate rate for 4 SSP's running separate copies of the single stream test. (Revised 10/20/03)

MBs copy scale add triad altix 3183 3118 3795 3807 X1 22111 21634 23658 23752 alpha 1339 1265 1273 1383 copy: x(i)=y(i) es45 1946 1941 1978 1978 scale: x(i)=A*y(i) SP3 523 561 581 583 add: z(i)=x(i)+y(i) SP3/16 486 494 601 601 triad: z(i)=x(i)+A*y(i) SP4 1774 1860 2098 2119 The following log-log plot shows the aggregrate shared memory performance for the STREAM triad operation. The SX-6 data is from the STREAM website. (Revised 10/20/03)

We have included our X1 results for 16 MSPs in the following table from the stream top 20 (10/7/03).

Machine ID                     ncpus   COPY    SCALE    ADD    TRIAD
------------------------------------------------------------------------
NEC_SX-7                          32 876174.7 865144.1 869179.2 872259.1
NEC_SX-5-16A                      16 607492.0 590390.0 607412.0 583069.0
SGI_Altix_3000                   256 414573.0 412108.0 485323.0 488274.0
NEC_SX-4                          32 434784.0 432886.0 437358.0 436954.0
HP_AlphaServer_GS1280             64 347712.3 341890.6 373126.5 377727.8
Cray_T932_321024-3E               32 310721.0 302182.0 359841.0 359270.0
CRAY X1 MSP                       16 306891.9 296893.8 334403.7 311499.9
NEC_SX-6                           8 202627.2 192306.2 190231.3 213024.3
Cray_C90                          16 105497.0 104656.0 101736.0 103812.0
SGI_Origin3800-500               256  87019.5  85514.4 101695.6  99680.2
HP_Integrity_SuperDome            64  82695.0  82476.0  83013.0  84223.0
IBM_eServer_p690+                 32  51455.0  53425.0  58651.0  58891.0
Sun_F15K                          72  54665.4  47703.7  46090.7  50724.3
SGI_Origin2000-250               256  42824.2  43213.5  48285.8  49275.5
Cray_SV1ex                        32  42317.8  42237.9  47829.8  47821.9
HP_AlphaServer_ES80                8  39898.0  40532.0  44519.0  44467.0
IBM_eServer_p670+                 16  32947.0  33673.0  35925.0  36818.0
IBM_eServer_p690_Turbo            32  28611.0  28994.0  32222.0  32249.0
Cray_Y-MP                          8  19291.6  19294.2  26588.9  26802.2
HP_SuperDome_750                  64  25762.3  21769.9  25675.0  26549.2
IBM_eServer_p690_HPC              16  20267.0  20265.0  24706.0  25058.0

The following graph illustrates the effect of various strides on memory bandwidth of triad.

The hint benchmark measures computation and memory efficiency as the problem size increases. (This is C hint version 1, 1994.) The following graph shows the performance of a single processor for the X1 (12.2 MQUIPS), Alpha (66.9 MQUIPS), Altix (102.6 MQUIPS), and SP4 (74.9 MQUIPS). The L1 and L2 cache boundaries are visible, as well as the Altix and SP4's L3. Revised 10/20/03

The lmbench benchmark measures various UNIX and system characeristics. Here are some preliminary numbers (revised 10/20/03) for runs on a service and compute node of alpha and SP3/4 (version 2).

LOW LEVEL BENCHMARKS (single processor)

The following table compares the performance of the X1, Alpha, and SP for basic CPU operations. These numbers are from the first few kernels of EuroBen's mod1ac. The 14th kernel (9th degree poly) is a rough estimate of peak FORTRAN performance since it has a high re-use of operands. Revised 10/20/03

alpha sp3 sp4 X1 Altix broadcast 516 368 1946 3773 2553 copy 324 295 991 2841 1758 addition 285 186 942 2322 1271 subtraction 288 166 968 2881 1307 multiply 287 166 935 2887 1310 division 55 64 90 612 213 dotproduct 609 655 2059 6250 724 X=X+aY 526 497 1622 4662 2707 Z=X+aY 477 331 1938 4731 2632 y=x1x2+x3x4 433 371 2215 3879 2407 1st ord rec. 110 107 215 56 142 2nd ord rec. 136 61 268 50 206 2nd diff 633 743 1780 6949 2963 9th deg. poly 701 709 2729 10655 5967 basic operations (Mflops) euroben mod1ac

The following table compares the performance of various intrinsics (EuroBen mod1f). Revised 10/20/03

alpha sp3 -O4 sp3 -O3 sp4 -O4 X1 Altix x**y 8.3 1.8 1.6 7.1 45.1 13.2 sin 13 34.8 8.9 64.1 94.4 22.9 cos 12.8 21.4 7.1 39.6 95.1 22.9 sqrt 45.7 52.1 34.1 93.9 648 107 exp 15.8 30.7 5.7 64.3 250 137 log 15.1 30.8 5.2 59.8 183 88.5 tan 9.9 18.9 5.5 35.7 80.8 21.1 asin 13.3 10.4 10.2 26.6 110 29.2 sinh 10.7 2.3 2.3 19.5 86.2 19.1 instrinsics (Mcalls/s) euroben mod1f (N=10000) The following table compares the performance (Mflops) of a simple FORTRAN matrix (REAL*8 400x400) multiply compared with the performance of DGEMM from the vendor math library (-lcxml for the Alpha, -lsci for the X1, -lessl for the SP). Note, the SP4 -lessl (3.3) is tuned for the Power4. Also the Mflops for 1000x1000 Linpack are reported from netlib except the sp4 number is from IBM. Altix number is with libgoto (not scs). (Revised 10/20/03) altix alpha sp3 sp4 X1 ftn 228 72 45 220 7562 lib 5222 1182 1321 3174 9482 linpack 1031 1236 2894 3955 In the following graph, the performance of the vendor library version of DGEMM is illustrated. The plot includes data from the new Compaq ES45 (1 GHz). The p690 achieves only 65% of peak because of insufficient rename registers. The X1, Alpha, and sp3 get a much higher percentage of peak. Revised 10/20/03

Rice's libgoto_it2-r0.7 and Intel Math Kernel Library ( mkl) gets higher DGEMM performance than -lscs as illustrated in the following plot. Revised 10/20/03

The following graph compares the vendor library implementation of an LU factorization (DGETRF) using partial pivoting with row interchanges. Revised 10/20/03

The following plots the performance of DAXPY for the various architectures and using the various runtime libraries on the Altix. The effect of the cache is apparent for both the X1 (2 MB) and Altix (6 MB). Revised 2/9/04

The following graph compares optimized FORTRAN performance (no sci/essl/cxml) for Euroben mod2a, matrix-vector dot product and product. Revised 10/20/03

The following table compares the single processor performance (Mflops) of the Alpha and IBMs for the Euroben mod2g, a 2-D Haar wavelet transform test. (Revised 10/20/03)

|---------------------------------------------------------------------------| | Order | alpha | SP3 | SP4 | X1 | Altix | | n1 | n2 | (Mflop/s) | (Mflop/s) | (Mflop/s) | (Mflop/s)| (Mflop/s)| |--------------------------------------------------------------------------- | 16 | 16 | 142.56 | 79.629 | 126.42 | 10.5 | 150.4 | | 32 | 16 | 166.61 | 96.690 | 251.93 | 13.8 | 192.1 | | 32 | 32 | 208.06 | 115.43 | 301.15 | 20.0 | 262.3 | | 64 | 32 | 146.16 | 108.74 | 297.26 | 22.7 | 252.7 | | 64 | 64 | 111.46 | 111.46 | 278.45 | 25.9 | 242.5 | | 128 | 64 | 114.93 | 101.49 | 251.90 | 33.3 | 295.6 | | 128 | 128 | 104.46 | 97.785 | 244.45 | 48.5 | 350.2 | | 256 | 128 | 86.869 | 64.246 | 179.43 | 45.8 | 211.2 | | 256 | 256 | 71.033 | 44.159 | 103.52 | 46.7 | 133.3 | | 512 | 256 | 65.295 | 41.964 | 78.435 | 52.1 | 168.7 | |---------------------------------------------------------------------------| The following plots the performance (Mflops) of Euroben mod2b, a dense linear system test, for both optimized FORTRAN and using the BLAS from the vendor library (cxml/essl). Revised 10/20/03

The following plots the performance (Mflops) of Euroben mod2d, a dense eigenvalue test, for both optimized FORTRAN and using the BLAS from the vendor library. For the Alpha, -O4 optimization failed, so this data uses -O3. Revised 10/20/03

The following plots the performance (iterations/second) of Euroben mod2e, a sparse eigenvalue test (no vendor libraries). At this time, the Cray FORTRAN compiler seems unable to either vectorize or stream this code. Revised 10/20/03

The following figures shows the FORTRAN Mflops for one processor for various problem sizes for the EuroBen mod2f, a 1-D FFT (complex to complex). The rate is for the transform only (no initialization time). Revised 10/20/03

The following compares a 1-D FFT using the FFTW benchmark. Altix uses ecc -O3. We were unable to run FFTW successfully on the Cray X1, in part we suspect, is that FFTW is targeted toward non-vector architectures.

The following graph plots 1-D FFT performance using the vendor library (-lscs, -lsci or -lessl), initialization time is not included. Revised 10/20/03

MESSAGE-PASSING BENCHMARKS

Internode communication can be accomplished with IP, PVM, or MPI. We report MPI performance over the Alpha Quadrics network and the IBM SP. Each SP node (4 CPUs) share a single network interface. However, each CPU is a unique MPI end point, so one can measure both inter-node and intra-node communication. The following table summarizes the measured communication characteristics between nodes of the X1, Alpha, SP3, and the SP4. SP4 is currently based on Colony switch via PCI. Latency is for 8-byte message. Altix consists of two 128-cpu images (or "nodes"). Unless otherwise noted, Altix message times are within a "node".

altix alpha sp3 sp4 X1 latency (1 way, us) 1.1 5.4 16.3 17 7.3 (3.8 SHMEM, 3.9 coarray) bandwidth (echo, MBs) 1955 199 139 174 12125 MPI within a node 1968 622 512 2186 latency (min, 1 way, us) and bandwidth (MBs) -- latency Bandwidth (min 1 way us, MBs) altix cpu 1.1 1968 alitx node 1.1 1955 X1 node 7.3 11776 X1 MSP 7.3 12125 alpha node 5.5 198 alpha cpu 5.8 623 alpha IP-sw 123 77 alpha IP-gigE/1500 76 44 alpha IP-100E 70 11 sp3 node 16.3 139 sp3 cpu 8.1 512 sp4 node 7 1400 (Federation) sp4 node 6 1702 (Federation dual rail) sp4 node 17 174 (PCI/Colony) sp4 cpu 3 2186 sp3 IP-sw 82 46 sp3 IP-gigE/1500 91 47 sp3 IP-gigE/9000 136 84 sp3 IP-100E 93 12

The following graph shows bandwidth for communication between two processors on the same node using MPI from both EuroBen's mod1h and ParkBench comms1. Within a node, shared memory can be used by MPI. Revised 10/21/03

The sp4 is presently equiped with a Colony switch for inter-node communcation but is limited by the PCI interface at this time (May, 2002). Inter-node bandwidth (2x128-cpu OS images) is about half intra-node (see shared-memory plot below).

The following graph shows the minimum latency (one-way, e.g., half of RTT) for an 8 byte message from CPU 0 to the other CPUs. The red is for our older 64 processor configuration, the green is for 128 CPUs, and the blue is our 2x128 configuration. The Altix MPI uses the distributed memory for MPI even across the two (128 CPU) system images. Revised 2/14/05

The following figure compares the effect of dplace when running the same latency test.

The HALO benchmark is a synthetic benchmark that simulates the nearest neighbour exchange of a 1-2 row/column "halo" from a 2-D array. This is a common operation when using domain decomposition to parallelize (say) a finite difference ocean model. There are no actual 2-D arrays used, but instead the copying of data from an array to a local buffer is simulated and this buffer is transfered between nodes. The following compares the performance of MPI and OpenMP using 9 and 16 on the Altix. Revised 10/8/03

For comparsion, we have included the Halo result for the X1 and ORNL's SP4 in the following table from Wallcraft ('98). (Revised 7/7/03)

LATENCY (us) MACHINE CPUs METHOD N=2 N=128 Cray X1 16 co-array 36 31 IBM SP4 16 MPI 27 32 SGI Altix 16 SHMEM 14 40 Cray X1 16 SHMEM 35 47 SGI Altix 16 OpenMP 15 48 Cray T3E-900 16 SHMEM 20 68 SGI Altix 16 MPI 19 72 SUN E10000 16 OpenMP 24 102 Cray X1 16 MPI 91 116 SGI O2K 16 SHMEM 36 113 SGI O2K 16 OpenMP 33 119 IBM SP4 16 OpenMP 58 126 HP SPP2000 16 MPI 88 209 IBM SP 16 MPI 137 222 SGI O2K 16 MPI 145 247 The Halo benchmarks also compares various algorithms within a given paradigm. The following compares the performance using various MPI methods on 16 MSPs for different problem sizes. Revised 10/7/03.

The following graph plots the bandwidth for doing an exchange of messages for various message sizes. The MPI implementation uses a repetion IRECV/SEND/IWAIT for each message size, the SHMEM/co-array do a repetition of PUT's and then a synch. The Altix MPI uses MPI_BUFFER_MAX 2048. Revised 10/21/03.

The following pair of graphs shows aggregate exchange bandwidth when 1 pair of processors and 64 pairs of processors do an exchange (processor i exchanges with i+n/2). The Altix SHMEM test is probably unrealistic in that data is not invalidated in the cache the PUT's are repeated in the timing loop. Revised 4/4/04

The following graph compares MPI for the HALO exchange on 4 and 16 processors. Revised 10/8/03

The following table shows the performance of aggregate communication operations (barrier, broadcast, sum-reduction) using one processor per node (N) and all processors on each node(n). Recall that the sp4 has 32 processors per node (the sp3 and alpha, 4 per node). Communications is between MSP's on the X1 except for the UPC data. Times are in microseconds. (Revised 10/7/03)

mpibarrier (average us) X1 cpus alpha-N alpha-n sp3-N sp3-n sp4-n mpi shmem coarray upc Altix 2 7 11 22 10 3 3 3.0 3.2 6.1 2 4 7 16 45 20 5 3 3.2 3.4 7.1 4 8 8 18 69 157 7 5 4.8 4.9 8.5 7 16 9 21 93 230 9 6 5.6 5.8 7.3 9 32 11 28 118 329 10 5 8.4 9.2 11.0 12 64 37 145 419 5 9.7 10.0 12.1 14 mpibcast (8 bytes) X1 cpus alpha-N alpha-n sp3-N sp3-n sp4-n mpi shmem coarray upc Altix 2 9.6 12.5 5.4 6.7 3.2 5.9 1.4 .3 0 2 4 10.4 20.3 9.4 9.4 6.2 7.2 4.1 .8 0.5 2.8 8 11.4 28.5 13.4 17.5 8.4 10.5 10.0 1.2 1.0 3.5 16 12.5 32.9 17.0 20.9 9.8 16.3 20.4 1.9 1.2 4.6 32 13.8 41.4 19.3 24.1 11.3 27.5 41.6 4.0 1.5 5.7 64 48.7 23.6 30.8 48.1 83 7.9 2.7 7.3 mpireduce (SUM, doubleword) cpus alpha-N alpha-n sp3-N sp3-n sp4-n X1 Altix 2 9 11 8 9 6 8 2 4 190 207 29 133 9 11 4 8 623 350 271 484 13 15 5 16 1117 604 683 1132 18 19 7 32 3176 1991 1613 2193 29 23 9 64 5921 2841 3449 31 12 The following compares the times for the collective MPI_allreduce (doubleword sum). Revised 10/7/03

A simple bisection bandwidth test has N/2 processors sending 1 MB messages to the other N/2. (Revised 2/13/05).

Aggregate datarate (MBs) cpus sp4 alpha X1 Altix 2 138 195 12412 7990 4 276 388 16245 15986 8 552 752 15872 31740 16 1040 1400 32626 62472 32 3510 29516 124064 48 35505 64 55553 242848 96 44222 128 59292 473600 200 139536 252 168107

The following compares the aggregate MPI bandwidth for processor pairs doing an exchange, where node i exchanges with node i+n/2. For smaller messages, the Altix outperforms the X1. (For X1-n, n represents the number of processors.) The second figure shows the effective per-pair bandwidth exchange data rate. Revised 3/8/04

Preliminary testing of TCP/IP performance over the local LAN showed that the Altix GigE interfaces could run TCP at 570 Mbs. We have experimented with Web100/net100 modifications to the Altix Linux kernel to accelerate wide area TCP performance.

SHARED-MEMORY BENCHMARKS

Both the Alpha and IBMs consist of a cluster of shared-memory nodes, each node with four processors sharing a common memory (16 for X1 and 32 for sp4). The X1 is cache-coherent within a node, but the memory space is global across all nodes. The Altix has a gloabl shared memory. We tested the performance of a shared-memory node with various C programs with explicit thread calls and with FORTRAN Open MP codes.

The following table shows the performance of thread/join in C as the master thread creates two, three, and four threads. The test repeatedly creates and joins threads.

threads alpha sp3 sp4 altix 2 47.7 96 44 399 3 165 152 68 842 4 251 222 97 1241 thread create/join time in microseconds (C) Often, it is more efficient to create the threads once, and then provide them work as needed. I suspect this is what FORTRAN Open MP is doing for "parallel do". The following table is the performance of parallel do. Revised 10/20/03 threads alpha sp3 sp4 altix 2 2.1 12.7 6.3 4.8 3 3.4 15.3 8.4 6.3 4 5.2 19.5 9.5 6.5 OPEN MP parallel DO (us) Also see the OpenMP microbenchmarks at the end of this section.

The following table shows the time required to lock-unlock using pthread_mutex_lock with various number of threads. For the IBMs we use setenv SPINLOOPTIME 5000. Revised 10/20/03

threads alpha sp3 sp4 altix 1 0.26 0.6 0.3 0.07 2 1.5 1.4 1.3 2.6 3 17.8 2.1 1.6 41.5 4 29.6 2.9 3.8 73.2 time for lock/unlock (us) The graph on the right compares the time for locking/unlocking a systemV semaphore when competing with other processors.

The following table compares the performance of simple C barrier program using a single lock and spinning on a shared variable along with pthread_yield. Revised 10/20/03

threads alpha sp3 sp4 altix 1 0.25 0.6 0.3 0.08 2 1.36 4.4 1.9 0.5 3 9.9 20.5 3.1 18.1 4 65 34.6 3.7 53.2 C barrier times (us) The following table illustrates linear speedup for an embarrassingly parallel integration. A C code with explicit thread management is compared with FORTRAN Open MP. Both just used -O optimization. Revised 10/20/03 fortran OpenMP C threads alpha sp3 sp4 altix alpha sp3 sp4 altix 1 252 102 251 891 166 52 216 558 2 502 204 501 1775 331 104 432 1114 3 748 306 752 2312 496 157 648 1668 4 990 408 1002 3519 657 206 864 2221 8 1999 6815 1725 4410 16 3565 12039 3429 8580 rectangle rule (Mflops) -O optimization The following table illustrates an explicit thread implementation of Cholesky factorization of a 1000x1000 double precision matrix in C (-O optimization). Revised 10/20/03 threads alpha sp3 sp4 altix 1 150 125 350 196 2 269 238 631 341 3 369 353 1007 512 4 435 390 1306 621 cholp 1k matrix factor (mflops) -O optimization

The following graph compares the echo (ping-pong) bandwith of SHMEM (put's) and MPI (ParkBench comms1) between two Altix CPUs. MPI performance is improved with setenv MPI_BUFFER_MAX 2048. The graph shows performance within a node and between nodes. Revised 10/20/03

The following graph illustrates the aggregrate SHMEM for a 1 MB put and get to processor 0 which is doing a sleep(). The graph also illustrates the memory contention when one processor is running STREAM (triad data plotted) at the same time as 0 or more processors are either doing continuous 1 MB SHMEM put's or get's with the STREAM processor. With previous software, behavior was quite strange, but now little interference is exhibited. Contrast this with our Cray X1 results. Revised 10/21/03

The following plot illustrates a SHMEM GET hotspot, where one or more processors are all trying to fetch the same 64-bit word from processor 0. The Y-axis is the average time (microseconds) for the SHMEM GET. Updated 7/9/04

The following illustrates the time to pass a 64-bit "hot potato" from one processor to the next using SHMEM PUT with each processor spinning on the "volatile shared" variable. The Y-axis is the average time for a single revolution. Updated 7/9/04

The following graph compares the HALO exchange times on 4 and 16 processors. The IBM is using OpenMP and the X1 co-arrays. Revised 10/8/03

HALO performance on 16 processors is illustrated in the next plot.

The following table compares FORTRAN OpenMP for the Altix, Alpha, and SP with co-arrays on the X1 when doing a simple, double-precision Jacobi iteration (1000x1000, tolerance = 10^-6). Revised 10/20/03

CPUs alpha sp3 sp4 X1 Altix 1 27 17 62 217 52 2 42 27 117 421 104 3 50 41 160 164 4 61 41 198 724 285 iterations per second We've also done some testing with the OpenMP microbenchmarks. The following compares OpenMP performance between the X1 (SSPs), sp4, and the SGI Altix. The biggest OpenMP configuration for the X1 is 16 SSPs (or 4 MSPs) and 32 CPUs for the IBM p690. OpenMP on the Altix can use all of the processors. Revised 12/23/03

PARALLEL KERNEL BENCHMARKS

Both ParkBench and EuroBen (euroben-dm) had MPI-based parallel kernels. However, the euroben-dm communication model was to have the processes do all of their send's before issuing receive's. On the SP, this model resulted in deadlock for the larger problem sizes. The EAGER_LIMIT can be adjusted to make some progress on the SP3 but the deadlocks could not be completely eliminated, so we report only ParkBench MPI results.

The following table show MPI parallel performance of the ParkBench LU (64x64x64) and FT benchmarks (256x256x128) for the Altix, X1, Alpha and SP. This is a small problem size, and doesn't permit the X1 to fill the vector pipes. These tests used standard FORTRAN (no vendor libraries). Revised 10/20/03

LU aggregate Mflops alpha sp3 sp4 X1 Altix 2 762 588 1377 639 2526 4 1604 1188 2660 1339 4679 8 3265 2473 5310 2394 8810 16 5556 4771 9531 3884 17654 FT.A 4 580 307 1314 1069 2208 8 849 553 2351 1986 3910 16 019 1056 3603 3732 7263

The following plots the aggregate Mflop performance for ParkBench QR factorization (MPI) of 1000x1000 double precision matrix using the vendor scientific libraries (scs/essl/cxml/sci). This benchmark uses BLACS (SCALAPACK). The small problem size results in small vectors and poor X1 performance. These results are using the ParkBench version of BLACS. (Revised 10/20/03)

As a further test of SCALAPACK performance, we compare the vendor libraries for matrix mutliply (pdgemm) and LU factorization (pdgetrf) of 8000x8000 double precision matrices using a blocksize of 32. The Cray X1 does well on the distributed matrix multiply, but not on the LU factorization. The Altix uses -lscs and netlib.org's scalapack library with BLACS/MPI. Revised 10/20/03

The following plot shows the performance of high-performance Linpack (HPL) on 16 processors for the Cray X1 and IBM p690 with MPI and the vendor BLAS. HPL solves a (random) dense linear system in double precision (64 bits) using: Two-dimensional block-cyclic data distribution - Right-looking variant of the LU factorization with row partial pivoting featuring multiple look-ahead depths - Recursive panel factorization with pivot search and column broadcast combined - Various virtual panel broadcast topologies - bandwidth reducing swap-broadcast algorithm - backward substitution with look-ahead of depth 1. Cray has reported 90% of peak using SHMEM instead of MPI. Revised 10/20/03

The following graph shows the aggregate Mflops for a multi-grid (MG) kernel from ParkBench/NAS Parallel Benchmark. This for a 256x256x256 doubleword grid with MPI and Wallcraft's co-array version and also OpenMP on the IBM. Revised 10/20/03

The following graph shows the aggregate Mflops for a conjugate gradient (CG) kernel (CLASS=A) from NAS Parallel Benchmarks 2.3 using MPI and OpenMP. Revised 10/20/03

We also ran the OpenMP version of the NAS Parallel Benchmarks (PBN-O-3.0b4). The following table compares the performance of three of those benchmarks on the power4 to the NERSC Power3 (seaborg, 16-way shared memory, 375 MHz). Revised 10/20/03

lu.A ft.A sp.A CPUs sp3 sp4 altix sp3 sp4 altix sp3 sp4 altix 2 675 1466 2598 356 1274 1940 427 1300 1651 4 1356 2974 3477 695 2259 3164 868 2379 2435 8 2231 6370 5019 1339 4166 5405 1724 4264 3635 16 2386 12148 6210 2343 6860 8779 2667 7476 5122 aggregate Mflops

links

altix
Altix numa architecture
Altix linux
ORNL Altix news release
itanium 2
Intel mkl math kernel library
Altix shared-memory architecture NUMALink and here
3000 ccNUMA architecture router, cables, CRC
SGI shared-memory architecture
altix performance or here
Altix vs IBM p690 paper
altix os linux
altix nwchem performance
older SGI spider routing chip for Origin 2000
SGI scsl library
scalapack
NAS parallel benchmarks
mac vector benchmarks
Wallcraft's HALO results and info
NAS MG Benchmark OpenMP, MPI, SHMEM, co - array comparison
Rice co-array fortran caf
new hpcc benchmarks
FFT info and blacs fft and C fftw.org
DGEMM on sp3, NERSC
Intel fortran manual efc

See Cray X1 and Cray XD1 and Power 4 results. Also see AMD opteron results.

Research Sponsors

Mathematical, Information, and Computational Sciences Division, within the Office of Advanced Scientific Computing Research of the Office of Science, Department of Energy. The application-specific evaluations are also supported by the sponsors of the individual applications research areas.