SGI recently introduced the Altix 3700. In contrast to previous SGI systems, the Altix uses a modified version of the open source Linux operating system and the latest Intel IA-64 processors, the Intel Itanium2. The Altix also uses the next generation SGI interconnect, Numalink3 and NUMAflex, which provides a NUMA, cache-coherent, shared memory, multi-processor system. In this paper, we present a performance evaluation of the SGI Altix using microbenchmarks, kernels, and mission applications. We find that the Altix provides many advantages over other non-vector machines and it is competitive with the Cray X1 on a number of kernels and applications. The Altix also shows good scaling, and its globally shared memory allows users convenient parallelization with OpenMP or pthreads.
The Cray X1 supercomputer is a distributed shared memory vector multiprocessor, scalable to 4096 processors and up to 65 terabytes of memory. The X1's hierarchical design uses the basic building block of the multi-streaming processor (MSP), which is capable of 12.8 GF/s for 64-bit operations. The distributed shared memory (DSM) of the X1 presents a 64-bit global address space that is directly addressable from every MSP with an interconnect bandwidth per computation rate of one byte per floating point operation. Our results show that this high bandwidth and low latency for remote memory accesses translates into improved application performance on important applications. Furthermore, this architecture naturally supports programming models like the Cray SHMEM API, Unified Parallel C, and Co-Array Fortran. It can be important to select the appropriate models to exploit these features, as our benchmarks demonstrate.
The Cray X1 supercomputer is a distributed shared memory vector multiprocessor, scalable to 4096 processors and up to 65 terabytes of memory. The shared memory provides high bandwidth and low latency in a globally-shared address space. The memory model supports SHMEM, co-arrays, and UPC as well as MPI. This paper describes the architecture and performance of the X1's distributed shared memory. Each layer of the memory hierarchy is described as well as the interconnecting fabric. Performance data is provided for the various programming models utilizing micro-benchmarks and results from porting and tuning a large application code.
Oak Ridge National Laboratory installed a 32 processor Cray X1 in March 2003, and will have a 256 processor system installed in October 2003. In this paper, we describe our initial evaluation of the X1 architecture, focusing on microbenchmarks, kernels, and application codes that highlight the performance characteristics of the X1 architecture and indicate how to use the system most efficiently.
This interim status report documents progress in the evaluation of the Cray X1 supercomputer at the Center for Computational Sciences at Oak Ridge National Laboratory.
ORNL recently installed a 32 processor Cray X1. In this paper, we describe our initial performance evaluation of the system, including microbenchmark data quantifying processor, memory, and network performance, and kernel data quantifying the impact of different approaches to using the system.
The report highlights the Probe project accomplishments in data mining, network research, and data storage technologies.
An overview of the problems in achieving high end-to-end performance on the Internet.
This document outlines the plan for evaluating the Cray X1 supercomputer at Oak Ridge National Lab.
End-to-end bandwidth estimation tools like Iperf though fairly accurate are intrusive. In this paper, we describe how, with an instrumented TCP stack (Web100), we can estimate the end-to-end bandwidth accurately, while consuming significantly less network bandwidth and time. We modified Iperf to use Web100 to detect the end of slow-start and estimate the end-to-end bandwidth by measuring the amount of data sent for a short period (1 second) after the slow-start, when the TCP throughput is relatively stable. We obtained bandwidth estimates differing by less than 10% when compared to running Iperf for 20 seconds, and savings in bandwidth estimation time of up to 94% and savings in network traffic of up to 92%.
Many high performance distributed applications require high network throughput but are able to achieve only a small fraction of the available bandwidth. A common cause of this problem is improperly tuned network settings. Tuning techniques, such as setting the correct TCP buffers and using parallel streams, are well known in the networking community, but outside the networking community they are infrequently applied. In this paper, we describe a tuning daemon that uses TCP instrumentation data from the Unix kernel to transparently tune TCP parameters for specified individual flows over designated paths. No modifications are required to the application, and the user does not need to understand network or TCP characteristics.
Oak Ridge National Laboratory recently received 24 32-way IBM pSeries 690 SMP nodes. In this paper, we describe our initial evaluation of the p690 architecture, focusing on the performance of benchmarks and applications that are representative of the expected production workload.
This report describes an implementation of a TCP-like protocol that runs over UDP. This TCP-like implementation, which does not require kernel modifications, provides a test harness for evaluating variations of the TCP transport protocol over the Internet. The test harness provides a tunable and instrumented version of TCP that supports Reno, NewReno, and SACK/FACK. The test harness can also be used to characterize the TCP-performance of a network path. Several case studies illustrate how one can tune the transport protocol to improve performance.
This report summarizes the second full year of work on the Probe project. The report highlights accomplishments in data mining, network research, and data storage technologies.
Network intruders often use non-standard ports or standard ports in non-standard ways to bypass detection. This report describes techniques and software for collecting, analyzing, and classifying Internet packet flows to assist in intrusion detection. Flows are characterized by packet size, interarrival times, direction, and inter-packet correlations without looking at packet contents. Statistical signatures for known flows are used to classify unknown flows.
This report describes techniques for backtracking internet packets with spoofed source addresses to their point of origin. Spoofed packets are often used in mounting denial of service attacks on the Internet. Algorithms for determining if routers pass spoofed addresses are also described.
This paper describes the general requirements for an intrusion prevention and detection system and the methods used to prevent and detect intrusions into Oak Ridge National Laboratory's network. This paper describes actual attacks, how they were detected, and how they were handled. A description of our monitoring tools is presented.
This report describes techniques for transmitting, receiving, and detecting covert Internet transmissions. Various fields in the IP, TCP, UDP, and ICMP headers can be used to carry hidden information. The capacity and detectability of these channels are discussed, and the experience with various software implementations is analyzed.
Oak Ridge National Laboratory (ORNL), Sandia National Laboratories (SNL), and Pittsburgh Supercomputing Center (PSC) are in the midst of a project through which their supercomputers are linked via high speed networks. The overall goal of this project is to solve national security and scientific problems too large to run on any single available machine. This paper describes the infrastructure used in the linked computing environment and discusses issues related to porting and running the Locally Self-consistent Multiple Scattering (LSMS) code in the linked environment. In developing a geographically distributed heterogeneous environment of high performance massively parallel processors (MPP) and porting code to it, a variety of problems were encountered and solved. Comparative performance measurements for the LSMS on a single machine and across linked machines are given along with an interpretation of the results.
ORNL and University of Tennessee at Knoxville (UTK) researchers are at the forefront in developing tools and techniques for using high-performance computers efficiently. Parallel Virtual Machine software enables computers connected around the world to work together to solve complex scientific, industrial, and medical problems. The Distributed Object Library makes it easier to program the Intel Paragon super-computer, recently enabling the ORNL Paragon to break a computational record in molecular dynamics simulations. The Distributed Object Network I/O Library increases the performance in data input and results output on the Intel Paragon, often by as much as an order of magnitude. ORNL and UTK researchers are also leaders in ongoing efforts to standardize the message-passing parallel
This report compares the performance of different computer systems for basic message passing. Latency and bandwidth are measured on Convex, Cray, IBM, Intel, KSR, Meiko, nCUBE, NEC, SGI, and TMC multiprocessors. Communication performance is contrasted with the computational power of each system. The comparison includes both shared and distributed memory computers as well as networked workstation clusters.
This report describes an architecture and implementation for doing group key management over a data communications network. The architecture describes a protocol for establishing a shared encryption key among an authenticated and authorized collection of network entities. Group access requires one or more authorization certificates. The implementation includes a simple public key and certificate infrastructure. Multicast is used for some of the key management messages. An application programming interface multiplexes key management and user application messages. An implementation using the new IP security protocols is postulated. The architecture is compared with other group key management proposals, and the performance and the limitations of the implementation are described.
Internet packet flows are analyzed within the framework of the traditional Poisson process paradigm. Applying the concepts of inhomogeneity, compoundness, and double stochasticity, a simple and intuitively transparent approach is proposed for explaining the expected shape of the observed histograms of counts and the expected behavior of the sample covariance functions.
This report summarizes communication performance of GigaNet's OC12 ATM interface for the Intel Paragon. One-way latency of 41 microseconds and bandwidth of 68 MB/s (full OC12) are measured using GigaNet's AAL5 API between two Paragons. Performance is compared with Ethernet, HiPPI, and the Paragon's native message-passing facility.
This research investigates techniques for providing privacy, authentication, and data integrity to PVM (Parallel Virtual Machine). PVM is extended to provide secure message passing with no changes to the user's PVM application, or, optionally, security can be provided on a message-by-message basis. Diffie-Hellman is used for key distribution of a single session key for n-party communication. Keyed MD5 is used for message authentication, and the user may select from various secret-key encryption algorithms for message privacy. The modifications to PVM are described, and the performance of secure PVM is evaluated.
This report describes a 1994 demonstration implementation of PVM that uses IP multicast. PVM's one-to-many unicast implementation of its pvm_mcast() function is replaced with reliable IP multicast. Performance of PVM using IP multicast over local and wide-area networks is measured and compared with the original unicast implementation. Current limitations of IP multicast are noted.
This report summarizes the beta testing of Chen System's eight processor CHEN-1000 server at Oak Ridge National Laboratory in the fall of 1995. The performance of the shared-memory multiprocessor is measured and compared with other shared-memory multiprocessors.
This report summarizes the third phase of a Cooperative Research and Development Agreement between Oak Ridge National Laboratory and Intel in evaluating a 28-node Intel Paragon MP system. An MP node consists of three 50-MHz i860XP's sharing a common bus to memory and to the mesh communications interface. The performance of the shared-memory MP node is measured and compared with other shared-memory multiprocessors. Bus contention is measured between processors and with message passing. Recent improvements in message passing and I/O are also reported.
Experiences and performance figures are reported from early tests of the 512-node Intel Paragon XPS35 at Oak Ridge National Laboratory. Computation performance of the 50 MHz i860XP processor as well as communication performance of the 200 megabyte/second mesh are reported and compared with other multiprocessors. Single and multiple hop communication bandwidths and latencies are measured. Concurrent communication speeds and speed under network load are also measured. File I/O performance of the mesh-attached Parallel File System is measured. Early experiences with OSF/Mach and SUNMOS operating systems are reported, as well results from porting various distributed-memory applications. This report also summarizes the second phase of a Cooperative Research and Development Agreement between Oak Ridge National Laboratory and Intel in evaluating a 66-node Intel Paragon XPS5.
Performance of the hierarchical shared-memory system of the Kendall Square Research multiprocessor is measured and characterized. The performance of prefetch is measured. Latency, bandwidth, and contention are analyzed on a 4-ring, 128 processor system. Scalability comparisons are made with other shared-memory and distributed-memory multiprocessors.
Algorithms for synchronizing the times and frequencies of the clocks of Intel and Ncube hypercube multiprocessors are presented. Bounds for the error in estimating clock offsets and frequencies are formulated in terms of the clock read error and message transmission time. Clock and communication performance of the Ncube and Intel hypercubes are analyzed, and performance of the synchronization algorithms is presented.
Initial performance results and early experiences are reported for the Kendall Square Research multiprocessor. The basic architecture of the shared-memory multiprocessor is described, and computational and I/O performance is measured for both serial and parallel programs. Experiences in porting various applications are described.
The communication performance of the i860-based Intel DELTA mesh supercomputer is compared with the Intel iPSC/860 hypercube and the Ncube 6400 hypercube. Single and multiple hop communication bandwidth and latencies are measured. Concurrent communication speeds and speed under network load are also measured. File I/O performance of the mesh-attached Concurrent File System is measured.
The performance of the Intel iPSC/860 hypercube and the Ncube 6400 hypercube are compared with earlier hypercubes from Intel and Ncube. Computation and communication performance for a number of low-level benchmarks are presented for the Intel iPSC/1, iPSC/2, and iPSC/860 and for the Ncube 3200 and 6400. File I/O performance of the iPSC/860 and Ncube 6400 are compared.
The structure and use of a remote host facility for controlling application programs on an Intel hypercube are described. The facility permits an alternate UNIX host, such as a graphics workstation or supercomputer, connected by a TCP/IP network to the Intel cube manager processor to communicate with application programs running on the hypercube nodes.
We describe the structure and use of a distributed computing environment based on a hypercube programming model that runs on most UNIX systems with TCP/IP network support. The package uses a library of message-passing routines and multiple user processes, written in either C or FORTRAN, to provide an environment for the development and testing of algorithms for hypercube parallel processors simulated by a collection of computers on a local area network.
The performance of four commercially available hypercube parallel processors is analyzed. Computation and communication performance for a number of low-level benchmarks are presented for the Ametek S14 hypercube, the Intel iPSC/1 hypercube, the Ncube hypercube, and the second generation Intel iPSC/2 hypercube.
The performance of three commercially available hypercube parallel processors is analyzed. Computation and communication performance for a number of low-level benchmarks are presented for the Ametek S14 hypercube, the Intel iPSC hypercube, and the Ncube hypercube.
The structure and use of a hypercube simulator that runs on most UNIX systems are described. The simulator uses a library of message-passing routines and multiple user processes, written in either C or FORTRAN, to provide an environment for the development and testing of algorithms for hypercube parallel processors.
The structure and use of a simulator for the Denelcor HEP multiprocessor are described. The simulator provides a multitasking environment for the development of parallel programs in C or FORTRAN using a library of subroutines that simulate the parallel programming constructs available on the HEP, a shared-memory multiprocessor. The simulator also provides a trace file that can be used for debugging, performance analysis, or graphical display.