Dunigan's p690 16p vs 32p

ORNL IBM Power4 (p690) 16 vs 32 processor evaluation

This is a supplement to our earlier p690 evaluation which was based on a 2-MCM (16 processor) node. This page describes differences between the 32-processor node (4 MCM's) and the 16-processor node. Also view Pat Worley's power4 test results
Last Modified

NOTE The tests on the IBM p690 are using early versions of the compilers and libraries, and we expect performance to continue to improve with new releases. Large page support and and page affinity will be provided soon and will also improve results. These tests were conducted in January, 2002.

The stream benchmark is a program that measures main memory throughput for several simple operations. The following shows the output from stream on the 16p and 32p p690.

16p results Function Rate (MB/s) RMS time Min time Max time Copy: 1518.4852 0.3542 0.3536 0.3633 Scale: 1510.7429 0.3557 0.3554 0.3575 Add: 1712.1203 0.4709 0.4704 0.4741 Triad: 1743.8818 0.4623 0.4618 0.4676 32p results Copy: 1478.6464 0.3644 0.3631 0.3812 Scale: 1451.9774 0.3709 0.3698 0.3852 Add: 1642.8126 0.4907 0.4902 0.4947 Triad: 1687.9547 0.4775 0.4771 0.4806 As can be seen, for a single processor run on the 16p the memory bandwidth is slighly higher than on 32p node. This results from the slightly higher latency in the 4 MCM (32p) node. Also, as noted above, at this time, AIX does not yet support page affinity.

The hint benchmark measures computation and memory efficiency as the problem size increases. The following graph also shows that a single processor test performs more slowly on the 32p p690 than on the 16p. The L1, L2 and L3 cache boundaries are visible.
The lmbench benchmark also shows the 32p box performs slightly slower on the memory tests than then 16p box.

If we run concurrent stream benchmarks on multiple processors, then the 32p box gives higher aggregate throughput, the 4 MCM's providing more paths to memory. The following graphs the stream triad bandwidth.

It is expected that the power4 will show even higher memory bandwidth when using 16 MB pages.

Most of the computational benchmarks show little difference for the 16p and 32p box because most execute out of the L2 and L3 cache.

MESSAGE-PASSING BENCHMARKS

The MPI intra-node performance for the 16p and 32p nodes was about the same, but the throughput for large messages is somewhat worse than what we measured a few months ago. The following graph shows bandwidth for communication between two processors on the same node using MPI from both EuroBen's mod1h and ParkBench comms1.

We re-ran some of the MPI parallel kernel benchmarks, but they user smaller messages sizes and did not exhibit any degradation from the earlier tests.