ORNL Cray XD1 Evaluation
.... this is work in progress.... last revised
October, 2004.
ORNL is evaluating the
Cray XD1
(formally Octigabay), an opteron-based cluster interconnected
with Infiniband.
Revision history:
2/24/05 faster streams numbers with AMD tips for pgf77
12/3/04 libgoto vs -lacml for dgemm/lu, NPB CG,FT with pg compile
12/2/04 recompiles and re-test with pgi/5.2-4, parkbench QR, MAPS
11/24/04 gpshmem
11/23/04 exchange on 64, halo update
11/16/04 64 processor tests
11/8/04 mpi clock synch tests -- not synchronized
11/4/04 6 chasis tests
11/2/04 MPI tests
11/1/04 ParkBench, euroben-dm, hpl
10/29 initial XD1 tests (3 chasis 34 cpus)
The results presented here are from standard benchmarks and some
custom benchmarks and, as such, represent only one part of the
evaluation.
An IBM SP4, IBM Winterhawk II (noted as SP3 in the following tables
and graphs),Cray X1, SGI Altix (Itanium 2), an Opteron cluster using Quadrix 3,
and Compaq Alpha ES40 at ORNL were used for comparison with the
XD1.
The results below are in the following categories:
The present ORNL XD1 test cluster consists of 18 2-way SMP
Opteron with 4 GB of memory (2.2 GHz).
One of the SMP's is partitioned as the compile node, leaving 34 processors
for MPI testing.
The X1 at ORNL has 8 nodes. Each node has 4 MSPs, each MSP has 4 SSPs,
and each SSP has two vector units.
The Cray "processor/CPU" in the results below is one MSP.
All 4 MSP's on a node share memory.
The Power4 consists of one node with 32 processors
sharing memory.
Both the Alpha and SP3 consist of four processors sharing memory
on a single node.
All memory is accessible on the Altix.
The following table summarizes the main characteristics of
the machines
Specs: Alpha SC SP3 SP4 X1 XD1 Altix
MHz 667 375 1300 800 2200 1500
memory/node 2GB 2GB 32GB 16GB 4GB 512GB
L1 64K 64K 32K 16K 64k 32K
L2 8MB 8MB 1.5MB 2MB 1MB 256K
L3 128MB 6MB
peak Mflops 2*MHz 4*MHz 4*MHz 12.8 2*MHz 4*MHz
peak mem BW 5.2GBs 1.6GBs 200+ GBs ?200+GBS 6.4GBs 6.4 GBs
alpha 2 buses @ 2.6 GBs each
X1 memory bandwidth is 34 GB/s/CPU.
For the Alpha, nodes are interconnected with a
Quadrics switch
organized as a fat tree.
The SP nodes are interconnected with cross-bar switches
in an Omega-like network.
The X1 uses a modified 2-D torus.
We have used widely available benchmarks in combination with
our own custom benchmarks to characterize the performance
of the X1.
Some of the older benchmarks may need to be modified for these
newer faster machines -- increasing repetitions to avoid 0 elapsed times,
increasing problem sizes to test out of cache performance.
Unless otherwise noted, the following compiler switches
were used on the Alpha and SP.
XD1: -O3 (pgf90 v5.2-2))
X1: -Oaggress,stream2 (arpun -n xxx -p 64k:16m a.out)
Alpha: -O4 -fast -arch ev6
SP: -O4 -qarch=auto -qtune=auto -qcache=auto -bmaxdata:0x70000000
Benchmarks were in C, FORTRAN, and FORTRAN90/OpenMP.
We also compared performance with the vendor runtime libraries, sci(X1),
cxml (Alpha), acml (AMD opteron),
and essl (SP).
We used the following benchmarks in our tests:
- ParkBench 2.1 --
provides low-level sequential and communication benchmarks,
parallel linear algebra benchmarks,
NAS parallel benchmarks,
and compact application codes.
Here is a summary of the benchmark modules.
Codes are in FORTRAN.
Results are often reported as least-squares fit of data.
We report actual performance numbers.
- EuroBen 3.9 -- provides
serial benchmarks for low-level performance and applicaton
kernels (linear algebra, eigen value, FFT, QR).
Here is a summary of the benchmark modules.
euroben-dm provides some communication and parallel (MPI)
benchmarks.
The web site includes results from other systems.
- lmbench --
provides insight into OS (UNIX) performance and memory latencies.
The web site includes results from other systems.
- stream --
measures memory bandwidth for both serial and parallel configurations.
Also we use the
MAPS memory benchmark.
The web sites include results from other systems.
- Custom low-level benchmarks that we have used over the years
in evaluating memory and communication performance.
For both the Alpha and the SP, gettimeofday() provides
microsecond wall-clock time (though one has to be sure MICROTIME
option is set in the Alpha OS kernel).
Both have high-resolution cylce counters as well, but the Alpha
cycle counter is only 32-bits so rolls over in less than 7 seconds.
For distributed benchmarks (MPI), the IBM and Alpha systems provide a hardware
synchronized MPI_Wtime() with microsecond resolution.
The stream benchmark
is a program that measures main memory throughput for several
simple operations.
The aggregate data rate for multiple threads is reported
in the following table.
Recall, that the "peak" memory data rate for the
X1 is 200 GBs, Alpha is 5.2 GBs,
and for the SP3 is 1.6 GBs.
Data for the 16-way SP3 (375 Mhz, Nighthawk II)
is included too.
Data for the Alpha ES45 (1 GHz) is obtained from the
streams data base.
Data for p690/sp4 is with affinity enabled (6/1/02).
The X1 uses (aprun -A).
The Opteron is supposed to have greater than 6.4 GB/s/cpu memory bandwidth,
we don't see that yet with stream_d.c?
MBs
copy scale add triad
xd1 2427 2369 2577 2598 pgcc -O3 pgi/5.2-4
xd1 2577 2491 2662 2641 pgf77 -O3 stream_d.f
xd1 2808 2793 3192 3186 pgf77 -O3 -mp -Mnontemporal -fastsse -Munsafe_par_align -o stream_d
xd1-2 7494 7473 8644 8553 2 threads/cpu's
xd1 3134 3169 3213 3217 PathScale compiler (from STREAMS site)
xd1 4098 4091 4221 4049 PathScale -O3 -LNO:prefetch=2 -LNO:prefetch_ahead=9 -LNO:fusion=2 -CG:use_prefetchnta=on
xt3 2639 2527 2652 2788 pgf77 just O3 stream_d.f (2.4 GHz)
xt3 3930 3791 4211 4308 pgf77 -O3 -mp -Mnontemporal -fastsse -Munsafe_par_align
xt3 3459 3747 3968 3968 reported by Sandia 2/23/05
xt3 4924 4928 4868 4830 reported by AMD 2.4 GHz 2/8/05 pathscale
xt3 4935 4934 5053 5069 pathscale -O3 -LNO:prefetch=2 -LNO:prefetch_ahead=9 -LNO:fusion=2 -CG:use_prefetchnta=on
altix 3214 3169 3800 3809
X1 22111 21634 23658 23752
alpha1 1339 1265 1273 1383
es45-1 1946 1941 1978 1978
SP3 1 523 561 581 583
SP3/16-1 486 494 601 601
SP4-1 1774 1860 2098 2119 p690
p655 3649 3767 2899 2913 ibm p655 11/3/04
From AMD's published spec benchmark and McCalpin's suggested
conversion of 171.swim results to triad memory bandwidth, we get 2.7 GB/s
memory bandwidth for one Opteron processor.
The
MAPS benchmark also characterizes memory access performance.
Plotted are load/store bandwidth for sequential (stride 1) and random
access.
Load is calculated from s=s+x(i)*y(i) and store from
x(i)= s.
Revised 12/2/04
The tabletoy benchmark (C) makes random writes of 64-bit integers in
a shared memory,
parallelization is permitted with possibly non-coherent updates.
The X1 number is for vectorizing the inner loop (multistreaming
was an order of magnitude slower 88 MBs).
Data rate in the following table is for a 268MB table.
We include multi-threaded altix, opteron, sp3 (NERSC), and sp4 data as well.
Revised 11/22/04
MBs (using wallclock time)
sp4-1 26 altix-1 42 X1-msp-1 1190 opteron-1 44 sp3-1 8
sp4-2 47 altix-2 45 opteron-2 67 sp3-2 26
sp4-4 98 altix-4 62 sp3-4 53
sp4-8 174 altix-8 86 sp3-8 90
sp4-16 266 altix-16 69 sp3-16 139
sp4-32 322 altix-32 77
The hint
benchmark measures computation and memory efficiency as
the problem size increases.
(This is C hint version 1, 1994.)
The following graph shows the performance of a single processor
for the XD1 (164 MQUIPS), X1 (12.2 MQUIPS),
Alpha (66.9 MQUIPS), Altix (88.2 MQUIPS), and SP4 (74.9 MQUIPS).
The L1 and L2 cache boundaries are visible, as well as the Altix and
SP4's L3.
Revised 10/29/04
Here results from LMbench, a test of various OS
services.
LOW LEVEL BENCHMARKS (single processor)
|
The following table compares the performance of the X1, Alpha, and SP
for basic CPU operations.
These numbers are the average Mflop/s from the first few kernels of EuroBen's mod1ac.
The 14th kernel (9th degree poly)
is a rough estimate of peak FORTRAN performance since it
has a high re-use of operands.
(Revised 10/29/04)
alpha sp3 sp4 X1 xd1 Altix
broadcast 516 368 1946 2483 1100 2553
copy 324 295 991 2101 733 1758
addition 285 186 942 1957 725 1271
subtraction 288 166 968 1946 720 1307
multiply 287 166 935 2041 726 1310
division 55 64 90 608 275 213
dotproduct 609 655 2059 3459 730 724
X=X+aY 526 497 1622 4134 1088 2707
Z=X+aY 477 331 1938 3833 1092 2632
y=x1x2+x3x4 433 371 2215 3713 1629 2407
1st ord rec. 110 107 215 48 181 142
2nd ord rec. 136 61 268 46 257 206
2nd diff 633 743 1780 4960 1636 2963
9th deg. poly 701 709 2729 10411 1576 5967
basic operations (Mflops) euroben mod1ac
The following table compares the performance of various intrinsics
(EuroBen mod1f).
For the SP, it also shows the effect of -O4 optimization versus -O3.
(Revised 10/29/04)
alpha sp3 -O4 sp3 -O3 sp4 -O4 X1 xd1 altix
x**y 8.3 1.8 1.6 7.1 49 7.4 13.2
sin 13 34.8 8.9 64.1 97.9 33.7 22.9
cos 12.8 21.4 7.1 39.6 71.4 21.8 22.9
sqrt 45.7 52.1 34.1 93.9 711 91.8 107
exp 15.8 30.7 5.7 64.3 355 26.8 137
log 15.1 30.8 5.2 59.8 185 39.2 88.5
tan 9.9 18.9 5.5 35.7 85.4 27.0 21.1
asin 13.3 10.4 10.2 26.6 107 23.4 29.2
sinh 10.7 2.3 2.3 19.5 82.6 18.2 19.1
instrinsics (Mcalls/s) euroben mod1f (N=10000)
The following table compares the performance (Mflops) of a simple
FORTRAN matrix (REAL*8 400x400) multiply compared with the performance
of DGEMM from the vendor math library (-lcxml for the Alpha,
-lsci for the X1,
-lessl for the SP).
Note, the SP4 -lessl (3.3) is tuned for the Power4.
Also the Mflops for 1000x1000 Linpack are reported
from netlib
except the sp4 number is from
IBM.
(Revised 10/29/04)
alpha sp3 sp4 X1 xd1 altix
ftn 72 45 220 7562 147 228
lib 1182 1321 3174 9482 3773 5222
linpack 1031 1236 2894 3955
The following plot compares the performance of
the scientific library DGEMM.
(Revised 10/29/04).
The following graph compares the vendor library implementation of
an LU factorization (DGETRF) using
partial pivoting with row interchanges.
Revised 11/1/04
The following graph shows there is little performance difference between
libgoto
and AMD's -lacml library.
Revised 12/3/04
The following plot compares the DAXPY performance of the Opteron and Itanium
(Altix) using vendor math libraries.
Revised 10/29/04
The following table compares the single
processor performance (Mflops) of the
Alpha and IBMs for the Euroben mod2g,
a 2-D Haar wavelet transform test.
(Revised 10/29/04)
|--------------------------------------------------------------------------
| Order | alpha | altix | SP4 | X1 | xd1 |
| n1 | n2 | (Mflop/s) | (Mflop/s) | (Mflop/s) | (Mflop/s)|(Mflop/s)|
|--------------------------------------------------------------------------
| 16 | 16 | 142.56 | 150.4 | 126.42 | 10.5 | 364 |
| 32 | 16 | 166.61 | 192.1 | 251.93 | 13.8 | 466 |
| 32 | 32 | 208.06 | 262.3 | 301.15 | 20.0 | 468 |
| 64 | 32 | 146.16 | 252.7 | 297.26 | 22.7 | 498 |
| 64 | 64 | 111.46 | 242.5 | 278.45 | 25.9 | 351 |
| 128 | 64 | 114.93 | 295.6 | 251.90 | 33.3 | 285 |
| 128 | 128 | 104.46 | 350.2 | 244.45 | 48.5 | 198 |
| 256 | 128 | 86.869 | 211.2 | 179.43 | 45.8 | 130 |
| 256 | 256 | 71.033 | 133.3 | 103.52 | 46.7 | 138 |
| 512 | 256 | 65.295 | 168.7 | 78.435 | 52.1 | 118 |
|--------------------------------------------------------------------------
The following plots the performance (Mflops) of
Euroben mod2b, a dense linear system test,
for both optimized FORTRAN and using the BLAS from the vendor library.
The following plots the performance (Mflops) of
Euroben mod2d, a dense eigenvalue test,
using the BLAS from the vendor library.
The following plots the performance (iterations/second) of
Euroben mod2e, a sparse eigenvalue test.
(Revised 10/29/04)
The following figure shows the FORTRAN Mflops for one processor for various
problem sizes
for the EuroBen mod2f, a 1-D FFT.
Data access is irregular, but cache effects are still apparent.
(Revised 10/29/04).
The following compares a 1-D FFT using the
FFTW benchmark.
The following graph plots 1-D FFT performance using the vendor
library (-lacml, -lscs, -lsci or -lessl), initialization time is not included.
Revised 10/29/04
MESSAGE-PASSING BENCHMARKS
|
Internode communication can be accomplished with IP, PVM, or MPI.
We report MPI performance over the Alpha Quadrics network and the
IBM SP.
Each SP node (4 CPUs) share a single network interface.
However, each CPU is a unique MPI end point, so one can measure
both inter-node and intra-node communication.
The following table summarizes the measured communication characteristics
between nodes
of the X1, Alpha, SP3, and the SP4.
SP4 is currently based on Colony switch via PCI.
The XD1 uses Infiniband hardware (2 GBs links) with an interface to the Opteron Hyper-Transport.
Latency is for 8-byte message.
(Revised 11/26/03)
alpha sp3 sp4 X1 XD1
latency (1 way, us) 5.4 16.3 17 7.3 1.5 (X1 3.8 SHMEM, 3.9 coarray)
bandwidth (echo, MBs) 199 139 174 12125 1335
MPI within a node 622 512 2186 966
latency (min, 1 way, us) and bandwidth (MBs)
-- latency Bandwidth (min 1 way us, MBs)
XD1 1.5 1335
XD1 in SMP 1.5 962
X1 node 7.3 11776
X1 MSP 7.3 12125
altix cpu 1.1 1968
alitx node 1.1 1955
alpha node 5.5 198
alpha cpu 5.8 623
alpha IP-sw 123 77
alpha IP-gigE/1500 76 44
alpha IP-100E 70 11
sp3 node 16.3 139
sp3 cpu 8.1 512
sp4 node 7 975 (Federation)
sp4 node 6 1702 (Federation dual rail)
sp4 node 17 174 (PCI/Colony)
sp4 cpu 3 2186
sp3 IP-sw 82 46
sp3 IP-gigE/1500 91 47
sp3 IP-gigE/9000 136 84
sp3 IP-100E 93 12
The following graph shows bandwidth for communication between two
processors on the same node using MPI
from both EuroBen's mod1h and ParkBench comms1.
Within a node, shared memory can be used by MPI.
(Revised 11/1/04).
Interestingly on the XD1, MPI is slower within an SMP than across the Infiniband.
The following graph shows the minimum MPI latency (one-way, e.g., half of RTT) for an 8 byte message from CPU 0 to the other CPUs for the Altix and Cray X1 and XD1.
Revised 3/31/05 (new altix data).
The
HALO benchmark is a synthetic benchmark that simulates the nearest neighbour
exchange of a 1-2 row/column "halo" from a 2-D array. This is a common
operation when using domain decomposition to parallelize (say) a finite
difference ocean model. There are no actual 2-D arrays used, but instead
the copying of data from an array to a local buffer is simulated and this
buffer is transfered between nodes.
For comparsion, we have included the Halo result for the X1
and ORNL's SP4
in the following table from Wallcraft ('98).
(Revised 10/29/04)
LATENCY (us)
MACHINE CPUs METHOD N=2 N=128
Cray XD1 16 MPI 13 37
Cray X1 16 co-array 36 31
IBM SP4 16 MPI 27 32
SGI Altix 16 SHMEM 14 40
Cray X1 16 OpenMP 13 13 (SSP)
Cray X1 16 SHMEM 35 47
SGI Altix 16 OpenMP 15 48
Cray T3E-900 16 SHMEM 20 68
SGI Altix 16 MPI 19 72
SUN E10000 16 OpenMP 24 102
Cray X1 16 MPI 91 116
SGI O2K 16 SHMEM 36 113
SGI O2K 16 OpenMP 33 119
IBM SP4 16 OpenMP 58 126
HP SPP2000 16 MPI 88 209
IBM SP 16 MPI 137 222
SGI O2K 16 MPI 145 247
The Halo benchmarks also compares various algorithms within
a given paradigm.
The following compares the performance using various MPI
methods on 16 MSPs for different problem sizes.
Two of the methods failed with MPI errors on the XD1.
Revised 11/24/04
The following graph compares MPI
for the HALO exchange on 4 and 16 processors.
For smaller message sizes, the XD1 is the best performer.
It is intersting that the X1 times are much higher than its 8-byte
message latency.
Revised 10/29/04
The following table shows the performance of aggregate communication
operations (barrier, broadcast, sum-reduction) using one processor
per node (N) and all processors on each node(n).
Recall that the sp4 has 32 processors per node (sp3 and alpha, 4 per node).
Communications is between MSP's on the X1 except for the UPC data.
Times are in microseconds.
(Revised 11/16/04)
mpibarrier (average us) X1
cpus alpha-N alpha-n sp3-N sp3-n sp4-n xd1 mpi shmem coarray upc
2 7 11 22 10 3 2 3 3.0 3.2 6.1
4 7 16 45 20 5 4 3 3.2 3.4 7.1
8 8 18 69 157 7 6 5 4.8 4.9 8.5
16 9 21 93 230 9 10 6 5.6 5.8 7.3
32 11 28 118 329 10 16 5 6.3 6.6 11.0
64 37 145 419 68 5 7.1 7.2 12.1
128 6 10.0 9.9
300 9 19.9 24.3
504 10 19.0 17.7
mpibcast (8 bytes) X1
cpus alpha-N alpha-n sp3-N sp3-n sp4-n xd1 mpi shmem coarray upc
2 9.6 12.5 5.4 6.7 3.2 1.2 5.9 1.4 .3 0
4 10.4 20.3 9.4 9.4 6.2 0.7 7.2 4.1 .8 0.5
8 11.4 28.5 13.4 17.5 8.4 1.2 10.5 10.0 1.2 1.0
16 12.5 32.9 17.0 20.9 9.8 17.9 16.3 20.4 1.9 1.2
32 13.8 41.4 19.3 24.1 11.3 20.7 27.5 41.6 4.0 1.5
64 48.7 23.6 30.8 89.0 48.1 83 7.9 2.7
mpireduce (SUM, doubleword)
cpus alpha-N alpha-n sp3-N sp3-n sp4-n X1 XD1
2 9 11 8 9 6 8 0.7
4 190 207 29 133 9 11 30.5
8 623 350 271 484 13 15 68.8
16 1117 604 683 1132 18 19 108
32 3176 1991 1613 2193 29 23 215
64 5921 2841 3449 31 389
mpiallreduce (SUM, doubleword)
cpus altix X1 XD1
2 5.7 11.4 2.4
4 14.5 16.3 4.3
8 22 24.9 6.7
16 30.3 31.6 10.8
32 39.3 45.5 16.6
48 47.4 53.2 49.1
64 58.9 66.1 68.4
96 80.0
A simple bisection bandwidth test has N/2 processors sending 1 MB messages
to the other N/2.
(Revised 11/16/04).
Aggregate datarate (MBs)
cpus sp4 alpha X1 Altix XD1
2 138 195 12412 1074 963 X1 half populated cabinets 11/26/03
4 276 388 16245 1150 1341
8 552 752 15872 1304 2663
16 1040 1400 32626 2608 2319
32 3510 29516 2608 4606
48 35505 5064 3458
64 55553 5120 6954
96 44222 7632
128 59292 10170
200 139536
252 168107
256 49664 X1 full cabinets 6/2/04
300 68595
400 120060
500 158350
504 167832
The following compares the aggregate MPI bandwidth for processor pairs doing an exchange, where node i exchanges with node i+n/2.
Revised 11/23/04
Preliminary testing of TCP/IP performance over the local LAN
showed that the XT3 GigE interfaces could run TCP at 817 Mbs.
Wide area performance will be limited by default window size of 256KB,
but the system manager can alter this.
Since there are only 2 SMP OPTERON processors, we will not bother
with various threading benchmarks.
SHMEM support for the XD1 is provided via
GPShmem.
A symmetric heap will not be supported til 1.3.
GPShmem uses MPI for collective operations, but not for puts and
gets.
Asynchronous operations are implemented on top of
ARMCI which directly implements
them using the native I/O.
One-way latency for gpshmem for an 8-byte message is about 4.8 us, slower
than MPI at this time.
The following graph compares gpshmem bandwidth with MPI between two XD1
nodes.
Revised 11/24/04
UPC and co-array fortran are not yet supported.
PARALLEL KERNEL BENCHMARKS
|
The following graph shows the aggregate Mflops for a conjugate gradient
(CG) kernel (CLASS=A) from NAS Parallel Benchmarks 2.3 using MPI and OpenMP.
Revised 12/3/04
The following plots the MPI performance of NPB 2.3 FT with a blocksize of
64 words.
The following plots the aggregate Mflop performance for ParkBench QR factorization (MPI) of 1000x1000 double precision matrix using the vendor scientific libraries (essl/cxml/sci).
This benchmark uses BLACS (SCALAPACK).
The small problem size results in small vectors and poor X1 performance.
The XD1 used mpicc and mpi77 for building the QR kernel.
Revised 12/2/04
The following plot shows the performance of high-performance
Linpack (HPL) on 16
processors for the Cray X1 and IBM p690 with MPI and the vendor BLAS.
HPL solves a (random) dense linear system in double precision (64 bits)
using: Two-dimensional block-cyclic data distribution - Right-looking variant of the LU factorization with row partial pivoting featuring multiple look-ahead depths - Recursive panel factorization with pivot search and column broadcast combined - Various virtual panel broadcast topologies - bandwidth reducing swap-broadcast algorithm - backward substitution with look-ahead of depth 1.
Cray has reported 90% of peak on the X1 using SHMEM instead of MPI.
Revised 11/1/04
also see our low-level results for
Cray X1 and
SGI Altix and
IBM p690 and
opteron cluster
Cray XD1
more xd1 config info
gpshmem
ARMCI
Opteron architecture
NWCHEM DFT
performance
AMD opteron benchmarks
AMD's ACML library
or here
or Opteron library libgoto
Opteron bios and kernel developer's guide
HPCC
XD1 results
papi
hpet
high precision timers
and rdtsc timers
and dclock
processor for Sandia's
Red Storm
OSC's RDMA
performance on XD1
Research Sponsors
Mathematical, Information, and Computational Sciences Division,
within the
Office of Advanced Scientific Computing Research of the Office of Science,
Department of Energy.
The application-specific evaluations are also supported by the sponsors of
the individual applications research areas.
Last Modified
thd@ornl.gov
(touches: )
back to Tom Dunigan's page
or the ORNL Evaluation
of Early Systems page or