ORNL Cray XT3 Evaluation
.... this is work in progress.... last revised
January, 2005.
ORNL is evaluating the
Cray XT3 (RedStorm)
an opteron-based cluster interconnected
with HyperTransport and Cray's SeaStar interconnect chip.
Revision history:
4/1/05 HALO shmem mpi updates
3/17/05 hpl on 16 processors, some hpcc tests
3/8/05 HALO mpi plots (latency still 28 us) barrier, broadcast, allreduce, parbench comms1 works now
2/24/05 faster streams numbers with AMD tips for pgf77 pathscale
1/24/05 NPB mpi CG FT
1/22/05 SHMEM
1/21/05 exchange
1/20/05 testing with alpha software, buggy, latency 32 us, bw 1.1 GBs
alpha software may be using IP for message passing!
single cpu numbers: dgemm, euroben, streams
1/4/05 1 cabinet XT3 to ORNL
Test environment
Package/Version Build Date
------------------ ----------
xt-boot-1.0-77 20041204
xt-libc-1.0-58 20041204
xt-mpt-1.0-76 20041204
xt-service-1.0-65 20041204
xt-prgenv-1.0-61 20041208
xt-pe-1.0-64 20041204
xt-os-1.0-72 20041204
xt-tests-0.0-57 20041204
xt-libsci-1.0-56 20041204
xt-catamount-1.15-68 20041204
pgi-5.2.4-0 20041005
acml-2.1-1 20041007
The results presented here are from standard benchmarks and some
custom benchmarks and, as such, represent only one part of the
evaluation.
An IBM SP4, IBM Winterhawk II (noted as SP3 in the following tables
and graphs),Cray X1, Cray XD1,
SGI Altix (Itanium 2), an Opteron cluster using Quadrix 3,
and Compaq Alpha ES40 at ORNL were used for comparison with the
XT3.
The results below are in the following categories:
The initial XT3 configuration at ORNL consists of a single cabinet with
80 processors....
The ORNL XD1 test cluster consists of 34 2-way SMP
Opteron with 4 GB of memory (2.2 GHz).
The X1 at ORNL has 8 nodes. Each node has 4 MSPs, each MSP has 4 SSPs,
and each SSP has two vector units.
The Cray "processor/CPU" in the results below is one MSP.
All 4 MSP's on a node share memory.
The Power4 consists of one node with 32 processors
sharing memory.
Both the Alpha and SP3 consist of four processors sharing memory
on a single node.
All memory is accessible on the Altix.
The following table summarizes the main characteristics of
the machines
Specs: Alpha SC SP3 SP4 X1 XD1 XT3 Altix
MHz 667 375 1300 800 2200 2400 1500
memory/node 2GB 2GB 32GB 16GB 4GB 1GB 512GB
L1 64K 64K 32K 16K 64k 64K 32K
L2 8MB 8MB 1.5MB 2MB 1MB 1MB 256K
L3 128MB 6MB
peak Mflops 2*MHz 4*MHz 4*MHz 12.8 2*MHz 2*MHz 4*MHz
peak mem BW 5.2GBs 1.6GBs 200+ GBs ?200+GBS 6.4GBs 6.4 GBs 6.4 GBs
alpha 2 buses @ 2.6 GBs each
X1 memory bandwidth is 34 GB/s/CPU.
For the Alpha, nodes are interconnected with a
Quadrics switch
organized as a fat tree.
The SP nodes are interconnected with cross-bar switches
in an Omega-like network.
The X1 uses a modified 2-D torus and the XT3 uses a 3-D torus.
We have used widely available benchmarks in combination with
our own custom benchmarks to characterize the performance
of the X1.
Some of the older benchmarks may need to be modified for these
newer faster machines -- increasing repetitions to avoid 0 elapsed times,
increasing problem sizes to test out of cache performance.
Unless otherwise noted, the following compiler switches
were used on the Alpha and SP.
XD1: -O3 (pgf90 v5.2-2))
X1: -Oaggress,stream2 (arpun -n xxx -p 64k:16m a.out)
Alpha: -O4 -fast -arch ev6
SP: -O4 -qarch=auto -qtune=auto -qcache=auto -bmaxdata:0x70000000
Benchmarks were in C, FORTRAN, and FORTRAN90/OpenMP.
We also compared performance with the vendor runtime libraries, sci(X1),
cxml (Alpha), acml (AMD opteron),
and essl (SP).
We used the following benchmarks in our tests:
- ParkBench 2.1 --
provides low-level sequential and communication benchmarks,
parallel linear algebra benchmarks,
NAS parallel benchmarks,
and compact application codes.
Here is a summary of the benchmark modules.
Codes are in FORTRAN.
Results are often reported as least-squares fit of data.
We report actual performance numbers.
- EuroBen 3.9 -- provides
serial benchmarks for low-level performance and applicaton
kernels (linear algebra, eigen value, FFT, QR).
Here is a summary of the benchmark modules.
euroben-dm provides some communication and parallel (MPI)
benchmarks.
The web site includes results from other systems.
- lmbench --
provides insight into OS (UNIX) performance and memory latencies.
The web site includes results from other systems.
- stream --
measures memory bandwidth for both serial and parallel configurations.
Also we use the
MAPS memory benchmark.
The web sites include results from other systems.
- Custom low-level benchmarks that we have used over the years
in evaluating memory and communication performance.
For both the Alpha and the SP, gettimeofday() provides
microsecond wall-clock time (though one has to be sure MICROTIME
option is set in the Alpha OS kernel).
Both have high-resolution cylce counters as well, but the Alpha
cycle counter is only 32-bits so rolls over in less than 7 seconds.
For distributed benchmarks (MPI), the IBM and Alpha systems provide a hardware
synchronized MPI_Wtime() with microsecond resolution.
The stream benchmark
is a program that measures main memory throughput for several
simple operations.
The aggregate data rate for multiple threads is reported
in the following table.
Recall, that the "peak" memory data rate for the
X1 is 200 GBs, Alpha is 5.2 GBs,
and for the SP3 is 1.6 GBs.
Data for the 16-way SP3 (375 Mhz, Nighthawk II)
is included too.
Data for the Alpha ES45 (1 GHz) is obtained from the
streams data base.
Data for p690/sp4 is with affinity enabled (6/1/02).
The X1 uses (aprun -A).
The Opteron is supposed to have greater than 6.4 GB/s/cpu memory bandwidth,
we don't see that yet with stream_d.c?
Revised 2/24/05
MBs
copy scale add triad
xt3 2639 2527 2652 2788 pgf77 just O3 stream_d.f
xt3 3930 3791 4211 4308 pgf77 -O3 -mp -Mnontemporal -fastsse -Munsafe_par_align
xt3 3459 3747 3968 3968 reported by Sandia 2/23/05
xt3 4924 4928 4868 4830 reported by AMD 2.4 GHz 2/8/05 pathscale
xt3 4935 4934 5053 5069 pathscale -O3 -LNO:prefetch=2 -LNO:prefetch_ahead=9 -LNO:fusion=2 -CG:use_prefetchnta=on
xd1 2427 2369 2577 2598 pgcc -O3 pgi/5.2-4
xd1 3134 3169 3213 3217 PathScale compiler (from STREAMS site)
xd1 4098 4091 4221 4049 pathscale (see above switches)
altix 3214 3169 3800 3809
X1 22111 21634 23658 23752
alpha1 1339 1265 1273 1383
es45-1 1946 1941 1978 1978
SP3 1 523 561 581 583
SP3/16-1 486 494 601 601
SP4-1 1774 1860 2098 2119 p690
p655 3649 3767 2899 2913 ibm p655 11/3/04
power5 5356 5138 4000 4039 ibm power5 1.656 GHz 12/15/04
From AMD's published spec benchmark and McCalpin's suggested
conversion of 171.swim results to triad memory bandwidth, we get 2.7 GB/s
memory bandwidth for one Opteron processor.
The
MAPS benchmark also characterizes memory access performance.
Plotted are load/store bandwidth for sequential (stride 1) and random
access.
Load is calculated from s=s+x(i)*y(i) and store from
x(i)= s.
Revised 12/2/04
The hint
benchmark measures computation and memory efficiency as
the problem size increases.
(This is C hint version 1, 1994.)
The following graph shows the performance of a single processor
for the XD1 (164 MQUIPS), X1 (12.2 MQUIPS),
Alpha (66.9 MQUIPS), Altix (88.2 MQUIPS), and SP4 (74.9 MQUIPS).
The L1 and L2 cache boundaries are visible, as well as the Altix and
SP4's L3.
Revised 1/20/05
Here results from LMbench, a test of various OS
services.
We do not have data for the XT3 as it runs a micro-kernel on the compute
nodes.
LOW LEVEL BENCHMARKS (single processor)
|
The following table compares the performance of the X1, Alpha, and SP
for basic CPU operations.
These numbers are the average Mflop/s from the first few kernels of EuroBen's mod1ac.
The 14th kernel (9th degree poly)
is a rough estimate of peak FORTRAN performance since it
has a high re-use of operands.
(Revised 1/20/05)
alpha xt3 sp4 X1 xd1 Altix
broadcast 516 1186 1946 2483 1100 2553
copy 324 795 991 2101 733 1758
addition 285 794 942 1957 725 1271
subtraction 288 794 968 1946 720 1307
multiply 287 794 935 2041 726 1310
division 55 299 90 608 275 213
dotproduct 609 796 2059 3459 730 724
X=X+aY 526 1188 1622 4134 1088 2707
Z=X+aY 477 1192 1938 3833 1092 2632
y=x1x2+x3x4 433 1772 2215 3713 1629 2407
1st ord rec. 110 197 215 48 181 142
2nd ord rec. 136 266 268 46 257 206
2nd diff 633 1791 1780 4960 1636 2963
9th deg. poly 701 1723 2729 10411 1576 5967
basic operations (Mflops) euroben mod1ac
The following table compares the performance of various intrinsics
(EuroBen mod1f).
For the SP, it also shows the effect of -O4 optimization versus -O3.
(Revised 1/20/05)
xt3 sp3 -O4 sp3 -O3 sp4 -O4 X1 xd1 altix
x**y 8.1 1.8 1.6 7.1 49 7.4 13.2
sin 36.6 34.8 8.9 64.1 97.9 33.7 22.9
cos 24.3 21.4 7.1 39.6 71.4 21.8 22.9
sqrt 99.2 52.1 34.1 93.9 711 91.8 107
exp 29.4 30.7 5.7 64.3 355 26.8 137
log 42.5 30.8 5.2 59.8 185 39.2 88.5
tan 29.3 18.9 5.5 35.7 85.4 27.0 21.1
asin 21.6 10.4 10.2 26.6 107 23.4 29.2
sinh 19.8 2.3 2.3 19.5 82.6 18.2 19.1
instrinsics (Mcalls/s) euroben mod1f (N=10000)
The following table compares the performance (Mflops) of a simple
FORTRAN matrix (REAL*8 400x400) multiply compared with the performance
of DGEMM from the vendor math library (-lcxml for the Alpha,
-lsci for the X1,
-lessl for the SP).
Note, the SP4 -lessl (3.3) is tuned for the Power4.
Also the Mflops for 1000x1000 Linpack are reported
from netlib
except the sp4 number is from
IBM.
(Revised 10/29/04)
alpha sp3 sp4 X1 xd1 altix
ftn 72 45 220 7562 147 228
lib 1182 1321 3174 9482 3773 5222
linpack 1031 1236 2894 3955
The following plot compares the performance of
the scientific library DGEMM.
Sandia reports 4.26 Glops for DGEMM on their XT3.
(Revised 1/20/05).
The following graph compares the vendor library implementation of
an LU factorization (DGETRF) using
partial pivoting with row interchanges.
Revised 1/20/05
The following graph shows there is little performance difference between
libgoto
and AMD's -lacml library.
Revised 1/20/05
The following plot compares the DAXPY performance of the Opteron and Itanium
(Altix) using vendor math libraries.
Revised 10/29/04
The following table compares the single
processor performance (Mflops) of the
Alpha and IBMs for the Euroben mod2g,
a 2-D Haar wavelet transform test.
(Revised 1/20/05)
|--------------------------------------------------------------------------
| Order | xt3 | altix | SP4 | X1 | xd1 |
| n1 | n2 | (Mflop/s) | (Mflop/s) | (Mflop/s) | (Mflop/s)|(Mflop/s)|
|--------------------------------------------------------------------------
| 16 | 16 | 471.05 | 150.4 | 126.42 | 10.5 | 364 |
| 32 | 16 | 535.68 | 192.1 | 251.93 | 13.8 | 466 |
| 32 | 32 | 611.52 | 262.3 | 301.15 | 20.0 | 468 |
| 64 | 32 | 600.22 | 252.7 | 297.26 | 22.7 | 498 |
| 64 | 64 | 529.22 | 242.5 | 278.45 | 25.9 | 351 |
| 128 | 64 | 461.31 | 295.6 | 251.90 | 33.3 | 285 |
| 128 | 128 | 362.02 | 350.2 | 244.45 | 48.5 | 198 |
| 256 | 128 | 268.67 | 211.2 | 179.43 | 45.8 | 130 |
| 256 | 256 | 172.92 | 133.3 | 103.52 | 46.7 | 138 |
| 512 | 256 | 130.86 | 168.7 | 78.435 | 52.1 | 118 |
|--------------------------------------------------------------------------
The following plots the performance (Mflops) of
Euroben mod2b, a dense linear system test,
for both optimized FORTRAN and using the BLAS from the vendor library.
The following plots the performance (Mflops) of
Euroben mod2d, a dense eigenvalue test,
using the BLAS from the vendor library.
The following plots the performance (iterations/second) of
Euroben mod2e, a sparse eigenvalue test.
(Revised 10/29/04)
The following figure shows the FORTRAN Mflops for one processor for various
problem sizes
for the EuroBen mod2f, a 1-D FFT.
Data access is irregular, but cache effects are still apparent.
(Revised 10/29/04).
The following compares a 1-D FFT using the
FFTW benchmark.
The following graph plots 1-D FFT performance using the vendor
library (-lacml, -lscs, -lsci or -lessl), initialization time is not included.
Revised 10/29/04
MESSAGE-PASSING BENCHMARKS
|
The XT3 processors are intereconnected in a 3D torus using HyperTransport
and Cray's SeaStar interconnect chip.
The peak bidirectional badnwidth of an XT3 link is 7.6 GBs, 4GBs sustained.
Internode communication can be accomplished with IP, PVM, or MPI.
We report MPI performance over the Alpha Quadrics network and the
IBM SP.
Each SP node (4 CPUs) share a single network interface.
However, each CPU is a unique MPI end point, so one can measure
both inter-node and intra-node communication.
The following table summarizes the measured communication characteristics
between nodes
of the X1, Alpha, SP3, and the SP4.
SP4 is currently based on Colony switch via PCI.
The XD1 uses Infiniband hardware with an interface to the Opteron Hyper-Transport.
Latency is for 8-byte message.
(Revised 11/26/03)
alpha altix sp4 X1 XD1
latency (1 way, us) 5.4 1.1 17 7.3 1.5 (X1 3.8 SHMEM, 3.9 coarray)
bandwidth (echo, MBs) 199 1955 174 12125 1335
MPI within a node 622 1968 2186 966
latency (min, 1 way, us) and bandwidth (MBs)
-- latency Bandwidth (min 1 way us, MBs)
XT3 29.4 1136 ? alpha software?
XD1 1.5 1335
XD1 in SMP 1.5 962
X1 node 7.3 11776
X1 MSP 7.3 12125
altix cpu 1.1 1968
alitx node 1.1 1955
alpha node 5.5 198
alpha cpu 5.8 623
alpha IP-sw 123 77
alpha IP-gigE/1500 76 44
alpha IP-100E 70 11
sp3 node 16.3 139
sp3 cpu 8.1 512
sp4 node 7 975 (Federation)
sp4 node 6 1702 (Federation dual rail)
sp4 node 17 174 (PCI/Colony)
sp4 cpu 3 2186
sp3 IP-sw 82 46
sp3 IP-gigE/1500 91 47
sp3 IP-gigE/9000 136 84
sp3 IP-100E 93 12
Early results of bandwidth of MPI and SHMEM on XT3 (alpha software).
The following graph shows bandwidth for communication between two
processors using MPI
from both EuroBen's mod1h and ParkBench comms1.
(ParkBench running on XT3 now, 3/8/05.)
(Revised 1/24/05).
The
HALO benchmark is a synthetic benchmark that simulates the nearest neighbour
exchange of a 1-2 row/column "halo" from a 2-D array. This is a common
operation when using domain decomposition to parallelize (say) a finite
difference ocean model. There are no actual 2-D arrays used, but instead
the copying of data from an array to a local buffer is simulated and this
buffer is transfered between nodes.
For comparsion, we have included the Halo result for the X1
and ORNL's SP4
in the following table from Wallcraft ('98).
(Revised 1/22/05)
LATENCY (us)
MACHINE CPUs METHOD N=2 N=128
Cray XT3 16 MPI 150 150 ?alpha software
Cray XD1 16 MPI 13 37
Cray X1 16 co-array 36 31
IBM SP4 16 MPI 27 32
SGI Altix 16 SHMEM 14 40
Cray X1 16 OpenMP 13 13 (SSP)
Cray X1 16 SHMEM 35 47
SGI Altix 16 OpenMP 15 48
Cray T3E-900 16 SHMEM 20 68
SGI Altix 16 MPI 19 72
SUN E10000 16 OpenMP 24 102
Cray X1 16 MPI 91 116
SGI O2K 16 SHMEM 36 113
SGI O2K 16 OpenMP 33 119
IBM SP4 16 OpenMP 58 126
HP SPP2000 16 MPI 88 209
IBM SP 16 MPI 137 222
SGI O2K 16 MPI 145 247
The Halo benchmarks also compares various algorithms within
a given paradigm.
The following compares the performance using various MPI
methods on 16 processors for different problem sizes.
Revised 3/8/05
The following graph compares MPI
for the HALO exchange on 4 and 16 processors.
For smaller message sizes, the XD1 is the best performer.
It is intersting that the X1 times are much higher than its 8-byte
message latency.
Revised 4/1/05
Until XT3 MPI latency is reduced,
we will not bother with the following MPI tests. ...
The following table shows the performance of aggregate communication
operations (barrier, broadcast, sum-reduction) using one processor
per node (N) and all processors on each node(n).
Recall that the sp4 has 32 processors per node (sp3 and alpha, 4 per node).
Communications is between MSP's on the X1 except for the UPC data.
Times are in microseconds.
(Revised 3/8/05)
mpibarrier (average us) X1
cpus alpha-N alpha-n sp4-n xd1 mpi shmem coarray upc XT3
2 7 11 3 2 3 3.0 3.2 6.1 39
4 7 16 5 4 3 3.2 3.4 7.1 79
8 8 18 7 6 5 4.8 4.9 8.5 119
16 9 21 9 10 6 5.6 5.8 7.3 159
32 11 28 10 16 5 6.3 6.6 11.0 199
64 37 68 5 7.1 7.2 12.1 239
128 6 10.0 9.9
300 9 19.9 24.3
504 10 19.0 17.7
mpibcast (8 bytes) X1
cpus alpha-N alpha-n sp4-n xd1 mpi shmem coarray upc XT3
2 9.6 12.5 3.2 1.2 5.9 1.4 .3 0 14.4
4 10.4 20.3 6.2 0.7 7.2 4.1 .8 0.5 29.1
8 11.4 28.5 8.4 1.2 10.5 10.0 1.2 1.0 43.5
16 12.5 32.9 9.8 17.9 16.3 20.4 1.9 1.2 58.2
32 13.8 41.4 11.3 20.7 27.5 41.6 4.0 1.5 72.4
64 48.7 89.0 48.1 83 7.9 2.7 86.7
mpireduce (SUM, doubleword)
cpus alpha-N alpha-n sp3-N sp3-n sp4-n X1 XD1 XT3
2 9 11 8 9 6 8 0.7 15.5
4 190 207 29 133 9 11 30.5 crash?
8 623 350 271 484 13 15 68.8
16 1117 604 683 1132 18 19 108
32 3176 1991 1613 2193 29 23 215
64 5921 2841 3449 31 389
mpiallreduce (SUM, doubleword)
cpus altix X1 XD1 XT3
2 5.7 11.4 2.4 35
4 14.5 16.3 4.3 70
8 22 24.9 6.7 104
16 30.3 31.6 10.8 151
32 39.3 45.5 16.6 191
48 47.4 53.2 49.1
64 58.9 66.1 68.4 233
96 80.0
A simple bisection bandwidth test has N/2 processors sending 1 MB messages
to the other N/2.
(Revised 1/20/05).
Aggregate datarate (MBs)
cpus sp4 xt3 X1 Altix XD1
2 138 1099 12412 1074 963 X1 half populated cabinets 11/26/03
4 276 2162 16245 1150 1341
8 552 4392 15872 1304 2663
16 1040 8648 32626 2608 2319
32 3510 16736 29516 2608 4606
48 25944 35505 5064 3458
64 33961 55553 5120 6954
96 44222 7632
128 59292 10170
200 139536
252 168107
256 49664 X1 full cabinets 6/2/04
300 68595
400 120060
500 158350
504 167832
The following compares the aggregate MPI bandwidth for processor pairs doing an exchange, where node i exchanges with node i+n/2.
Revised 1/22/05
Preliminary testing of TCP/IP performance over the local LAN
showed that the XT3 GigE interfaces could run TCP at 898 Mbs.
Wide area performance will be limited by default window size of 256KB,
but the system manager can alter this.
Preliminary SHMEM tests on the XT3 yield 8-bytes latency of 38 us
and 1.1 GBs bandwidth ( alpha-software performance ).
HALO performance on 16 processors is illustrated in the next plot.
Revised 4/1/05
PARALLEL KERNEL BENCHMARKS
|
The following graph shows the aggregate Mflops for a conjugate gradient
(CG) kernel (CLASS=A) from NAS Parallel Benchmarks 2.3 using MPI and OpenMP.
Revised 1/24/05
The following plots the MPI performance of NPB 2.3 FT with a blocksize of
64 words.
The following plots the aggregate Mflop performance for ParkBench QR factorization (MPI) of 1000x1000 double precision matrix using the vendor scientific libraries (essl/cxml/sci).
This benchmark uses BLACS (SCALAPACK).
The small problem size results in small vectors and poor X1 performance.
The XD1 used mpicc and mpi77 for building the QR kernel.
Revised 12/2/04
The following plot shows the performance of high-performance
Linpack (HPL) on 16
processors for the Cray X1 and IBM p690 with MPI and the vendor BLAS.
HPL solves a (random) dense linear system in double precision (64 bits)
using: Two-dimensional block-cyclic data distribution - Right-looking variant of the LU factorization with row partial pivoting featuring multiple look-ahead depths - Recursive panel factorization with pivot search and column broadcast combined - Various virtual panel broadcast topologies - bandwidth reducing swap-broadcast algorithm - backward substitution with look-ahead of depth 1.
Cray has reported 90% of peak on the X1 using SHMEM instead of MPI.
Revised 3/17/05
HPCC results (see hpcc
) for 64 processors.
XT3 results with N=10000 revised 3/17/05
HPL PTRANS GUPS triad Bndwth latency
tflops GBs GBs us
Cray X1 0.522 12.4 0.01 30 0.99 20.2
Cray XD1 0.224 10.6 0.02 2.7 0.22 1.63
Cray XT3 0.138 12.1 0.04 2.6 1.14 29.6
Sandia (2/23/05) reports HPCC numbers on 552 nodes of 1.4 Tflops for HPL (55% peak).
PTRANS at 49.6 GBs,
also see our low-level results for
Cray X1 and
Cray XD1 and
SGI Altix and
IBM p690 and
opteron cluster
Cray XT3
and PSC news release
yod
Opteron architecture
NWCHEM DFT
performance
AMD opteron benchmarks
AMD's ACML library
or here
or Opteron library libgoto
Opteron bios and kernel developer's guide
papi
hpet
high precision timers
and rdtsc timers
and dclock
processor for Sandia's
Red Storm
Research Sponsors
Mathematical, Information, and Computational Sciences Division,
within the
Office of Advanced Scientific Computing Research of the Office of Science,
Department of Energy.
The application-specific evaluations are also supported by the sponsors of
the individual applications research areas.
Last Modified
thd@ornl.gov
(touches: )
back to Tom Dunigan's page
or the ORNL Evaluation
of Early Systems page or