ORNL IBM Power4 (p690) evaluation
Recent results (also see our recent Cray X1 results
and our SGI Altix results)
- 4/21/05 2 rails federation
latency 6 us bandwidth 1635 MBs
- 6/15/04 --federation tests after microcode upgrade (4 paths/ 2 rails for multiple processors only)
Federation colony today colony '02 1-rail '02
latency (us) 7 19 17
bandwidth (MBs) 1486 306 345 175
exchange (1x1)(MBs)1037 273 367 215
exchange (32x32) 6938 394
Following graph compares colony and federation for Euroben and ParkBench
communication benchmarks. Revised 6/17/04
The following plot compares the variations in round-trip time for an
8-byte message between nodes (1000 tests).
- 10/14/03 --
federation swtich testing on 2 32-cpu nodes:
Federation specs: 2 GBs per port ( 2 ports), latency: 5 to 9 us
- updated message-passing results, Colony/PCI, dual
rail, 6/1/02
- updated stream results with affinity on p690 5/15/02
- LPAR testing 2/15/02
We compared the performance of 8-cpu/8 GB LPAR nodes to 8-thread tests
on 16-cpu/16GB node and 32-cpu/32GB node.
Arguably, when running just an 8 cpu test, the other cpu's on the non-LPAR
nodes are wasted/unavailable.
QR factorization, MPI, 8 cpu's
Machine Gflops
LPAR 7.1
16x16 7.9
32x32 8.2
The following reveals slight performance loss for competing LPARs on
the same node.
NAS OpenMP SP.C 8 threads
32x32 3.7
16x16 3.5
1 LPAR 2.83
2 LPARs 2.79 (fastest of the 2)
3 LPARs 2.72
4 LPARs 2.68
In the following we look at the STREAM memory benchmark aggregate triad
memory bandwidth far various number of streams on an LPAR and nodes.
The LPAR threads are restricted to one L3 and 8 GB of memory.
Notice that for one thread, the LPAR does better (probably due
to lower memory latency, not having to access other L3's).
This is consistent with Worley's results (next bullet).
- Worley's
lpar tests 2/15/02
- Results from a few tests on 32 processor (1/14/2002)
- also view Pat Worley's
power4 test results
Last Modified
following data collected October, 2001
NOTE The tests on the IBM p690 are using early versions
of the compilers and libraries, and we expect performance to continue
to improve with new releases.
Large page support and and page affinity will be provided soon and
will also improve results.
|
Oak Ridge National Laboratory (ORNL) is currently performing an in-depth
evaluation of the IBM Power 4 (p690) system
as part of its
evaluation of early systems project.
The primary tasks of the evaluation are to
- determine the most effective approaches to using the IBM p690
- evaluate benchmark and application performance, and compare with similar
systems from other vendors;
- evaluate system and system administration software reliability and
performance;
- predict scalability, both in terms of problem size and in number of
processors.
The emphasis of the evaluation is on application-relevant studies
for applications of importance to DOE.
However, standard benchmarks are still important for comparisons
with other systems.
The results presented here are from standard benchmarks and some
custom benchmarks and, as such, represent only one part of the
evaluation.
A large IBM Winterhawk II (noted as SP3 in the following tables
and graphs) and Compaq Alpha ES40 at ORNL were used for comparison with the
IBM p690 (noted as SP4 in the tables and graphs).
The results below are in the following categories:
ARCHITECTURE
The present Power4 consists of one node with 16 processors (2 MCM's)
sharing memory.
Both the Alpha and SP3 consist of four processors sharing memory
on a single node.
The following table summarizes the main characteristics of
the machines
Specs: Alpha SC SP3 SP4
MHz 667 375 1300
memory/node 2GB 2GB 32GB
L1 64K 64K 32K
L2 8MB 8MB 1.5MB
L3 128MB
peak Mflops 2*MHz 4*MHz 4*MHz
peak mem BW 5.2GBs 1.6GBs 200+ GBs ?
alpha 2 buses @ 2.6 GBs each
For the Alpha, nodes are interconnected with a
Quadrics switch
organized as a fat tree.
The SP3 nodes are interconnected with cross-bar switches
in an Omega-like network.
BENCHMARKS
We have used widely available benchmarks in combination with
our own custom benchmarks to characterize the performance
of the SP4.
Some of the older benchmarks may need to be modified for these
newer faster machines -- increasing repetitions to avoid 0 elapsed times,
increasing problem sizes to test out of cache performance.
Unless otherwise noted, the following compiler switches
were used on the Alpha and SP.
Alpha: -O4 -fast -arch ev6
SP: -O4 -qarch=auto -qtune=auto -qcache=auto -bmaxdata:0x70000000
Benchmarks were in C, FORTRAN, and FORTRAN90/OpenMP.
We also compared performance with the vendor runtime libraries, cxml (Alpha)
and essl (SP).
We used the following benchmarks in our tests:
- ParkBench 2.1 --
provides low-level sequential and communication benchmarks,
parallel linear algebra benchmarks,
NAS parallel benchmarks,
and compact application codes.
Here is a summary of the benchmark modules.
Codes are in FORTRAN.
Results are often reported as least-squares fit of data.
We report actual performance numbers.
- EuroBen 3.9 -- provides
serial benchmarks for low-level performance and applicaton
kernels (linear algebra, eigen value, FFT, QR).
Here is a summary of the benchmark modules.
euroben-dm provides some communication and parallel (MPI)
benchmarks.
The web site includes results from other systems.
- lmbench --
provides insight into OS (UNIX) performance and memory latencies.
The web site includes results from other systems.
- stream --
measures memory bandwidth for both serial and parallel configurations.
The web site includes results from other systems.
- Custom low-level benchmarks that we have used over the years
in evaluating memory and communication performance.
For both the Alpha and the SP, gettimeofday() provides
microsecond wall-clock time (though one has to be sure MICROTIME
option is set in the Alpha OS kernel).
Both have high-resolution cylce counters as well, but the Alpha
cycle counter is only 32-bits so rolls over in less than 7 seconds.
For distributed benchmarks (MPI), both systems provide a hardware
synchronized MPI_Wtime() with microsecond resolution.
On the Alpha, MPI_Wtime is frequency synchonized, but initial
offsets are only approximate. (On the Alpha, it appears MPI_Init tries
to provide an initial zero offset to the Elan counters
on each node when an MPI job starts.
On the SP3, we discovered several nodes that were not synchronized, a patch
was eventually provided.)
We recently received an essl tuned for the power4. The following
graph plots the power3 essl vs the new power4 essl (3.3) for
the Euroben mod2b benchmark.
The newer essl provides signifcant improvement.
Also plotted is the performance of mod2b without essl,
comparing arch=pwr3 versus arch=pwr4 for xlf (both run on the power4).
The large difference between the
"compiled from source" version and the ESSL version is typical -- even
the best compilers are quite sensitive to the details of the source
code organization for this class of algorithms.
Memory performance
Both the SP3 and the Alpha have 64 KB L1 caches and 8 MB L2 caches.
The SP4 has a 32KB L1 (FIFO), a 1.4 L2 (shared between 2 processors),
and a 128 MB L3.
The following figure shows the data rates for a simple FORTRAN
loop to load ( y = y+x(i)), store (y(i)=1), and
copy (y(i)=x(i)), for different vector sizes.
Data is also included for four threads.
(Beware of the linear interpolation between data points, and
note we need to extend the test beyond 128 MB to get out of the SP4
L3 cache.
It has been suggested the the "dcbz" SP4 instruction that
allocates the target cache line in the L2 without loading it from
memory first could further improve SP4 performance.
Also see McCalpin's
stream2 benchmark.)
At the tail end of the graph above, the program starts fetching
data from main memory.
For load, a single Alpha thread is reading data at 1.7 GBs,
the SP3 at 787 MBs.
For four threads, the load per-cpu rate drops to 811 MBs for
the Alpha and 322 MBs for the SP.
The aggregate rate for 4 CPUs from the test is then
3.2 GBs for the Alpha compared to 1.3 GBs for the SP.
The stream benchmark
is a program that measures main memory throughput for several
simple operations.
The aggregate data rate for multiple threads is reported
in the following table.
Recall, that the "peak" data rate for the Alpha is 5.2 GBs
and for the SP3 is 1.6 GBs.
Data for the 16-way SP3 (375 Mhz, Nighthawk II)
is included too.
Data for the Alpha ES45 (1 GHz) is obtained from the
streams data base.
Data for p690/sp4 is with affinity enabled (6/1/02).
copy scale add triad
alpha1 1339 1265 1273 1383
alpha2 1768 1711 1839 1886
alpha3 2279 2280 2257 2308
alpha4 2375 2323 2370 2427
es45-1 1946 1941 1978 1978
es45-2 2615 2592 2825 2850
es45-4 3487 3487 3527 3584
SP3 1 523 561 581 583
SP3 2 686 797 813 909
SP3 3 833 805 897 914
SP3 4 824 799 889 914
SP3/16-1 486 494 601 601
SP3/16-2 953 969 1161 1153
SP3/16-3 1422 1408 1775 1757
SP3/16-4 1703 1724 1955 1982
SP3/16-8 4850 4601 5060 5211
SP3/16-16 5475 5325 5976 5924
SP4-1 1774 1860 2098 2119
SP4-2 3513 3684 4166 4225
SP4-4 7170 7463 8238 8075
SP4-8 13101 13300 14986 14825
SP4-16 21598 21156 24106 23609
SP4-32 27271 26072 29539 28750
stream (f90/omp) multiple threads (aggregate MB/sec)
The following graphs the triad bandwidth from the previous table.
The 9690/sp4 results are from June, 2002 with affinity supported by
AIX. McCalpin reports the following improvement of aggregate STREAM with affinity
on 32 processor p690 (4/15/02).
MBs
no aff.
Copy 22421 28611 22%
Scale 21411 28994 26%
Add 24830 32222 23%
Triad 25501 32249 21%
It is expected that the power4 will show even higher memory
bandwidth when using 16 MB (large) pages.
The following figure shows the Mflops for one processor for various
problem sizes
for the EuroBen mod2f, a 1-D FFT.
Data access is irregular, but cache boundaries are still apparent.
The hint
benchmark measures computation and memory efficiency as
the problem size increases.
(This is hint Version 1, 1994.)
The following graph shows the performance of a single processor
for the Alpha (66.9 MQUIPS), SP3 (27.3 MQUIPS), and SP4 (74.9 MQUIPS).
The L1 and L2 cache boundaries are visible, as well as the SP4's L3 (128 MB).
The
lmbench benchmark
measures various UNIX and system characeristics.
Here are some preliminary numbers
for runs on a service and compute node of alpha and SP3/4 (version 2).
(Results from previous lmbench version can be found here.)
Open/close times are much slower for the Alpha, though file create/delete
are faster on the Alpha.
The cache/memory latencies reported by lmbench are
alpha sp3
L1 4 5
L2 27 32
memory 210 300
latency in nanoseconds
LOW LEVEL BENCHMARKS
The following table compares the performance of the Alpha and SP
for basic CPU operations.
These numbers are from the first 14 kernels of EuroBen's mod1ac.
The 14th kernel is a rough estimate of peak FORTRAN performance since it
has a high re-use of operands.
alpha sp3 sp4
broadcast 516 368 1946
copy 324 295 991
addition 285 186 942
subtraction 288 166 968
multiply 287 166 935
division 55 64 90
dotproduct 609 655 2059
X=X+aY 526 497 1622
Z=X+aY 477 331 1938
y=x1x2+x3x4 433 371 2215
1st ord rec. 110 107 215
2nd ord rec. 136 61 268
2nd diff 633 743 1780
9th deg. poly 701 709 2729
basic operations (Mflops) euroben mod1ac
The following table compares the performance of various intrinsics
(EuroBen mod1f).
For the SP, it also shows the effect of -O4 optimization versus -O3.
alpha sp3 -O4 sp3 -O3 sp4 -O4
x**y 8.3 1.8 1.6 7.1
sin 13 34.8 8.9 64.1
cos 12.8 21.4 7.1 39.6
sqrt 45.7 52.1 34.1 93.9
exp 15.8 30.7 5.7 64.3
log 15.1 30.8 5.2 59.8
tan 9.9 18.9 5.5 35.7
asin 13.3 10.4 10.2 26.6
sinh 10.7 2.3 2.3 19.5
instrinsics (Mcalls/s) euroben mod1f (N=10000)
The following table compares the performance (Mflops) of a simple
FORTRAN matrix (REAL*8 400x400) multiply compared with the performance
of DGEMM from the vendor math library (-lcxml for the Alpha,
-lessl for the SP).
Note, the SP4 -lessl (3.3) is tuned for the Power4.
Also the Mflops for 1000x1000 Linpack are reported
from netlib
except the sp4 number is from
IBM.
alpha sp3 sp4
ftn 72 45 220
lib 1182 1321 3174
linpack 1031 1236 2894
In the following graph, the performance of the
ATLAS DGEMM (xdl3blastst -F ) is compared with the vendor libraries.
The plot includes data from the new Compaq ES45 (1 GHz).
The p690 achieves only 65% of peak because of insufficient rename
registers.
The Alpha's and sp3 get a much higher percentage of peak.
The following table compares
optimized FORTRAN performance (no essl/cxml) for Euroben mod2a,
matrix-vector dot product and product.
-------------------------------------------------
alpha sp3 sp4
Problem size| MxV-ddot | MxV-ddot | MxV-ddot |
m | n | (Mflop/s) | (Mflop/s) | (Mflop/s) |
--------------------------------------------------
100 | 100 | 411.7 | 423.9 | 783.4 |
200 | 200 | 442.3 | 416.8 | 808.4 |
500 | 500 | 66.1 | 18.7 | 148.4 |
1000 | 1000 | 31.8 | 17.1 | 91.8 |
2000 | 2000 | 27.5 | 16.1 | 69.9 |
--------------------------------------------------
--------------------------------------------------
alpha sp3 sp4
Problem size| MxV-axpy | MxV-axpy | MxV-axpy |
m | n | (Mflop/s) | (Mflop/s) | (Mflop/s) |
--------------------------------------------------
100 | 100 | 101.9 | 401.9 | 1053. |
200 | 200 | 227.4 | 421.1 | 1092. |
500 | 500 | 205.4 | 411.9 | 857.5 |
1000 | 1000 | 205.6 | 274.5 | 746.8 |
2000 | 2000 | 66.9 | 207.9 | 730.2 |
-------------------------------------------------
The following table compares the single
processor performance (Mflops) of the
Alpha and IBMs for the Euroben mod2g,
a 2-D Haar wavelet transform test.
|------------------------------------------------------
| Order | alpha | SP3 | SP4 |
| n1 | n2 | (Mflop/s) | (Mflop/s) | (Mflop/s) |
|------------------------------------------------------
| 16 | 16 | 142.56 | 79.629 | 126.42 |
| 32 | 16 | 166.61 | 96.690 | 251.93 |
| 32 | 32 | 208.06 | 115.43 | 301.15 |
| 64 | 32 | 146.16 | 108.74 | 297.26 |
| 64 | 64 | 111.46 | 111.46 | 278.45 |
| 128 | 64 | 114.93 | 101.49 | 251.90 |
| 128 | 128 | 104.46 | 97.785 | 244.45 |
| 256 | 128 | 86.869 | 64.246 | 179.43 |
| 256 | 256 | 71.033 | 44.159 | 103.52 |
| 512 | 256 | 65.295 | 41.964 | 78.435 |
|------------------------------------------------------
The following plots the performance (Mflops) of
Euroben mod2b, a dense linear system test,
for both optimized FORTRAN and using the BLAS from the vendor library (cxml/essl).
The following plots the performance (Mflops) of
Euroben mod2d, a dense eigenvalue test,
for both optimized FORTRAN and using the BLAS from the vendor library.
For the Alpha, -O4 optimization failed, so this data uses -O3.
The following plots the performance (iterations/second) of
Euroben mod2e, a sparse eigenvalue test.
We ran a number of tests of NCAR CCM
Column Radiation Model (CRM), using different compiler options, libraries,
and problem sizes (300+ test cases).
We used the executables from the SP3 on the SP4. The SP4 provided
speedups from 1.1 to 2.6 faster than the SP3.
The FORTRAN code is dominated by exponentials and square roots.
Also see Worley PSTSW climate code
test results.
EuroBen's mod3a tests matrix computation with file I/O (out
of core).
The following tables compare the Alpha with the IBM.
The run was made using /tmp, and
no attempt was made to optimize I/O performance.
Mod3a: Out-of-core Matrix-vector multiplication
Alpha
--------------------------------------------------------------------------
Row | Column | Exec. time | Mflop rate | Read rate | Write rate |
(n) | (m) | (sec) | (Mflop/s) | (MB/s) | (MB/s) |
--------------------------------------------------------------------------
25000 | 20000 | 0.40751E-01| 24.539 | 226.92 | 62.082 |
50000 | 20000 | 0.80691E-01| 24.786 | 225.43 | 68.371 |
100000 | 100000 | 0.43051 | 46.455 | 250.62 | 73.322 |
250000 | 100000 | 1.4878 | 33.607 | 253.08 | 78.265 |
--------------------------------------------------------------------------
SP3
--------------------------------------------------------------------------
25000 | 20000 | .34146 | 2.9286 | 253.80 | .72046 |
50000 | 20000 | .74303 | 2.6917 | 255.00 | .62662 |
100000 | 100000 | 1.4190 | 14.094 | 248.45 | .88511 |
250000 | 100000 | 3.5659 | 14.021 | 248.31 | 1.1102 |
--------------------------------------------------------------------------
p690
--------------------------------------------------------------------------
25000 | 20000 | .16075 | 6.2207 | 39.468 | 109.57 |
50000 | 20000 | .25080 | 7.9744 | 497.94 | 1.8376 |
100000 | 100000 | .72657 | 27.526 | 160.30 | 3.9978 |
250000 | 100000 | 1.0282 | 48.626 | 400.84 | 13.699 |
--------------------------------------------------------------------------
This should not be considered a rigorous test of the I/O subsystem.
SHARED-MEMORY BENCHMARKS
Both the Alpha and IBMs consist of a cluster of shared-memory nodes,
each node with four processors sharing a common memory (16 for sp4).
We tested the performance of a shared-memory node with various
C programs with explicit thread calls and with FORTRAN Open MP codes.
The following table shows the performance of thread/join in C as the
master thread creates two, three, and four threads.
The test repeatedly creates and joins threads.
threads alpha sp3 sp4
2 47.7 96 44
3 165 152 68
4 251 222 97
thread create/join time in microseconds (C)
Often, it is more efficient to create the threads once, and then
provide them work as needed.
I suspect this is what FORTRAN Open MP is doing for "parallel do".
The following table is the performance of parallel do.
Revised 9/8/03
threads alpha sp3 sp4
2 2.1 12.7 6.3
3 3.4 15.3 8.4
4 5.2 19.5 9.4
OPEN MP parallel DO (us)
Notice that the performance is much better than the explicit thread calls.
We've also done some testing with the
OpenMP microbenchmarks.
The following compares OpenMP performance between the sp4 and the SGI Altix.
The following table shows the time required to lock-unlock
using pthread_mutex_lock with various number of threads.
For the IBMs we use setenv SPINLOOPTIME 5000.
threads alpha sp3 sp4
1 0.26 0.6 0.3
2 1.5 1.4 1.3
3 17.8 2.1 1.6
4 29.6 2.9 3.8
time for lock/unlock (us)
The following table compares the performance of simple C barrier
program using a single lock and spinning on a shared variable
along with pthread_yield.
A version based on condition variables was an order of magnitude slower.
threads alpha sp3 sp4
1 0.25 0.6 0.3
2 1.36 4.4 1.9
3 9.9 20.5 3.1
4 65 34.6 3.7
C barrier times (us)
The following table illustrates linear speedup for an embarrassingly
parallel integration.
A C code with explicit thread management is compared with FORTRAN
Open MP.
Both just used -O optimization.
fortran C threads
alpha sp3 sp4 alpha sp3 sp4
1 252 102 251 166 52 216
2 502 204 501 331 104 432
3 748 306 752 496 157 648
4 990 408 1002 657 206 864
8 1999 1725
16 3565 3429
rectangle rule (Mflops) -O optimization
The following table illustrates an explicit thread implementation
of Cholesky factorization of a 1000x1000 double precision
matrix in C (-O optimization).
threads alpha sp3 sp4
1 150 125 350
2 269 238 631
3 369 353 1007
4 435 390 1306
cholp 1k matrix factor (mflops) -O optimization
The following table compares FORTRAN OpenMP for the Alpha and SP
doing a simple, double-precision Jacobi iteration.
Note that the SP3 slows for 4 threads.
problem size 500x500 1000x1000
threads alpha sp3 sp4 alpha sp3 sp4
1 175 114 247 27 17 62
2 342 284 466 42 27 117
3 503 421 650 50 41 160
4 655 324 850 61 41 198
iterations per second
MESSAGE-PASSING BENCHMARKS
Internode communication can be accomplished with IP, PVM, or MPI.
We report MPI performance over the Alpha Quadrics network and the
IBM SP.
Each node (4 CPUs) share a single network interface.
However, each CPU is a unique MPI end point, so one can measure
both inter-node and intra-node communication.
The following table summarizes the measured communication characteristics
of the Alpha, SP3, and the SP4.
SP4 is currently based on Colony switch via PCI.
alpha sp3 sp4
latency (1 way, us) 5.4 16.3 17
bandwidth (echo, MBs) 199 139 174 (345 MBs for dual plane)
(exchange, MBs) 167 180 215 (367 MBs for dual plane)
MPI within a node 622 512 2186
latency (min, 1 way, us) and bandwidth (MBs)
-- latency Bandwidth (min 1 way us, MBs)
alpha node 5.5 198
alpha cpu 5.8 623
alpha IP-sw 123 77
alpha IP-gigE/1500 76 44
alpha IP-100E 70 11
sp3 node 16.3 139
sp3 cpu 8.1 512
sp4 node 17 174 (PCI/Colony)
sp4 cpu 3 2186
sp3 IP-sw 82 46
sp3 IP-gigE/1500 91 47
sp3 IP-gigE/9000 136 84
sp3 IP-100E 93 12
The following graph shows bandwidth for communication between two
processors on the same node using MPI
from both EuroBen's mod1h and ParkBench comms1.
The SP3 performs better for smaller messages.
The sp4 is presently equiped with a Colony switch for inter-node
communcation but is limited by the PCI interface at this time (May, 2002).
The following graph shows bandwidth for communication between two nodes.
The p690 also supports dual rail Colony connections that roughly
doubles the bandwidth for large messages as illustrated in the following
graph.
The following table shows the performance of aggregate communication
operations (barrier, broadcast, sum-reduction) using one processor
per node (N) and all processors on each node(n).
Recall that the sp4 has 16 processors per node (the other 4 per node).
Times are in microseconds.
mpibarrier (average us)
cpus alpha-N alpha-n sp3-N sp3-n sp4-n
2 7 11 22 10 4
4 7 16 45 20 13
8 8 18 69 157 18
16 9 21 93 230 27
32 11 28 118 329
64 37 145 419
mpibcast (8 bytes)
cpus alpha-N alpha-n sp3-N sp3-n sp4-n
2 9.6 12.5 5.4 6.7 3.2
4 10.4 20.3 9.4 9.4 6.2
8 11.4 28.5 13.4 17.5 8.4
16 12.5 32.9 17.0 20.9 9.8
32 13.8 41.4 19.3 24.1
64 48.7 23.6 30.8
mpireduce (SUM, doubleword)
cpus alpha-N alpha-n sp3-N sp3-n sp4-n
2 9 11 8 9 6
4 190 207 29 133 9
8 623 350 271 484 13
16 1117 604 683 1132 18
32 3176 1991 1613 2193
64 5921 2841 3449
PARALLEL KERNEL BENCHMARKS
Both ParkBench and EuroBen (euroben-dm) had MPI-based parallel
kernels.
However, the euroben-dm communication model was to have the processes
do all of their send's before issuing receive's.
On the SP, this model resulted in deadlock for the larger problem
sizes.
The EAGER_LIMIT can be adjusted to make some progress on the SP3 but
the deadlocks could not be completely eliminated.
MPI buffering on the Alpha was adequate.
The maximum MPI buffering on an SP3 node was 64 MBytes, on the Alpha, 191 MBytes.
The following table show MPI parallel performance of the LU benchmark
(64x64x64) for the Alpha and SP.
The first column pair is one processor per node, the second pair
is using all processors per node.
These tests used standard FORTRAN (no vendor libraries).
Nodes CPUs
alpha sp3 alpha sp3 sp4
2 786.08 617.98 762.92 588.16 1377.38
4 1708.1 1387.03 1604.05 1188.02 2660.97
8 3384.03 2561.97 3265.83 2473.80 5310.63
16 6190.89 5593.18 5556.02 4771.66 9531.77
aggregate Mflops
Results for the FT benchmarks (CLASS=A) follow
Nodes CPUs
alpha sp3 alpha sp3 sp4
4 633 465 580 307 1314
8 1198 925 849 553 2351
16 2221 1890 1019 1056 3603
aggregate Mflops
Results for the NAS SP benchmark follow.
Nodes CPUs
alpha sp3 alpha sp3 sp4
4 877 632 734 416 1219
9 2310 1623 1837 1225 2568
16 4344 2920 3143 2252 3939
aggregate Mflops
The following plots the aggregate Mflop performance for ParkBench
QR factorization (MPI) of 1000x1000 double precision matrix.
One can compare the performance of optimized FORTRAN versus the
vendor libraries (cxml/essl), and the difference in performance
when using all processors on a node.
Recall, that our SP4 has 16 CPUs sharing memory so we have included
data (sp3-16 and sp3-16-essl) from the NERSC 16-way SP3 (375 MHz).
The following graph shows the aggregate Mflops for a multi-grid (MG)
kernel from ParkBench/NAS Parallel Benchmark.
This for a 256x256x256 doubleword grid with MPI and Wallcraft's
co-array version and also OpenMP on the IBM.
Revised 9/3/03.
The following graph shows the aggregate Mflops for a conjugate gradient
(CG) kernel (CLASS=A) from NAS Parallel Benchmarks 2.3 using MPI and OpenMP.
Revised 9/22/03
We also ran the OpenMP version of the NAS Parallel Benchmarks (PBN-O-3.0b4).
The following table compares the performance of three of those benchmarks
on the power4 to the NERSC Power3 (seaborg, 16-way shared memory, 375 MHz).
lu.A ft.A sp.A
CPUs sp3 sp4 sp3 sp4 sp3 sp4
2 675 1466 356 1274 427 1300
4 1356 2974 695 2259 868 2379
8 2231 6370 1339 4166 1724 4264
16 2386 12148 2343 6860 2667 7476
aggregate Mflops
compiled with -O3 -qarch=auto -qtune=auto -qcache=auto -qsmp=omp -qfixed
One should be cautious when
comparing these sp4 results to the MPI NAS results presented earlier.
Related Links
Research Sponsors
Mathematical, Information, and Computational Sciences Division,
within the
Office of Advanced Scientific Computing Research of the Office of Science,
Department of Energy.
The application-specific evaluations are also supported by the sponsors of
the individual applications research areas.
Last Modified
thd@ornl.gov
(touches: )
back to Tom Dunigan's page
or the ORNL Evaluation
of Early Systems page or
ornl |
ccs |
csm:
research |
people |
sitemap |
disclaimer |
search