C-DAC Logo











C-DAC Home
C-DAC's Tera-Scale Supercomputing Facility
Seismic Application Benchmarks on PARAM 10000
CFD Application Benchmark on PARAM 10000
General Enquiry Form
 Home > HPCC > NPSF > PARAM 10000 > Application & System Benchmarks
Application and System Benchmarks on PARAM 10000

PARAM 10000 High Performance Computing System commissioned by
C-DAC in 1998 is a part of the National PARAM Supercomputing Facility (NPSF) at Pune. In order to assess the performance of the system expected against diverse scientific and business computing applications, the system has been benchmarked on certain key applications and internationally accepted benchmarks. Macro and Micro benchmarks have been used to measure the performance of PARAM 10000.

A Macro benchmark measures the performance of a computer system as a whole. It compares different systems with respect to an application class, and is a useful input for the user.

Micro benchmarks tend to be of the synthetic kernels types that measure a specific aspect of a computing system, such as CPU speed, Memory speed, I/O speed, Operating system performance, and Networking etc.

Usually the benchmark data is quite accurate and sufficient for classifying systems into groups of comparable systems according to their performance on certain standard computing problems, which help in estimating the performance of systems for a particular problem area.

The objective of the C-DAC benchmark is to provide a common ground to test the system and to investigate the effective use of real application programs and also employ short kernel codes to evaluate the sustained performance. To obtain sustained performance, thus several benchmarks and test suites were executed on PARAM 10000. The important benchmarks and their performance on PARAM 10000 are presented here.

Following are the benchmarks performed:

PARAM 10000 has peak computing power of 100 Giga floating-point operations per second (GFLOPS). The processors are based on SUN Microsystems's UltraSparc (RISC) architecture, each operating at a clock speed of 300 MHz supporting Solaris OS 2.6. It has 36 compute nodes and 4 server nodes. Each node is a quad-processor, SMP (shared memory, symmetric multiprocessing) system. Each file server has 4GB while each compute node has 2GB main memory. Each node has 16 KB L1 cache and 2 MB L2 cache.

The summary results are reproduced below in Figure 1 in a graphical representation, while details are given in the succeeding paragraphs.

Different experiments were performed to extract maximum performance on PARAM 10000 using C-DAC's HPCC (High Performance Computing and Communication) software suite. The PARAMNet, an interconnect switch as a system area network (SAN), was developed by C-DAC for use with PARAM 10000. Its architecture is conceived as a high-speed switched network built around a high speed and low latency switch for cluster computing. In PARAMNet cluster computing environment, message passing library support is provided on TCP/IP, as well as lightweight protocol. The system software consists of Optimized MPI (C-DAC MPI) on Shared Memory and Active Messages (AM) over PARAMNet, provided by a lightweight protocol known as KSHIPRA, developed by C-DAC. These host interfaces are intelligent adapters and employ C-DAC's Communication Co-processor (CCP) ASIC as the main block. Thus, the implementation of MPI over AM using PARAMNet with CCP plays a major role for performance boost on clusters. The performance of benchmarks on PARAM 10000 is achieved by capturing the effects of workstation architecture, the network protocols, implementation of MPI over AM using PARAMNet with intelligent Network Interface Card (NIC) based on CCP. The switch uses cut-through or wormhole approach for routing of packets. It has 8 ports, with each port supporting 400 + 400 Mbits/sec bi-directional bandwidth.

Figure 1. Performance of LINPACK (hplbench) on PARAM 10000.

The performance would further improve using the next generation of PARAMNet-II designed at C-DAC, which is based on a 16-port switch with each port supporting 2.5+2.5 Gbits/sec bi-directional bandwidth and a Network Interface Card, based on CCP-III. PARAMNet-II is also expected to boost the network performance substantially both in terms of bandwidth and latency.

 

P-COMS Benchmarks

Several approaches have been used to extract the performance of MPI point-to-point communications on cluster. The MPI standard provides a flexible environment for developing high performance parallel applications with several mechanisms for point-to-point and collective communications. PARAM - Communication Overhead Measurement Suites (P-COMS) are set of test programs, developed by C-DAC which extract the performance of MPI point-to-point and collective communications on PARAM 10000. This communication model is simple but is advantageous considerably because it is platform independent and can be used for performance comparison of different MPI implementations.

The cost of MPI communication primitives on PARAM 10000 is very low. It is achieved by capturing the effects of workstation architecture, the network protocols and implementation of MPI on AM over PARAMNet with intelligent Network Interface Card (NIC) and C-DAC's Communication Co-processor (CCP) ASIC. Further, the cost of communication is reduced with usage of Optimised MPI (C-MPI) on Shared Memory. Consequently, the system area network and C-MPI implementation on PARAM 10000 is generally superior to Fast Ethernet.

The start-up time in microseconds (ms), bandwidth in MB/s, for small and large message sizes, which characterize the point-to-point communication overhead, have been measured. Latency, in the sense of a message passing system, refers to the cost to set up a message transmission or time taken for an operation. Start-up time is also called as latency, which is the time in microseconds to communicate a zero-byte or short message. A popular method for measuring a point-to-point communication (e.g., between processor 0 and processor 1) is the ping-pong scheme. On PARAM 10000, the achieved latency is 25 ms and bandwidth is 35 MB/s using HPCC software at an application layer. Interestingly, bandwidth of 40 MB/s has been achieved when all the 16 processors (4 nodes) are used with a single PARAMNet switch for a cluster of 8 nodes.

Figure 2: Performance of Scatterv communication primitive.

The results for Scatterv communication primitive using HPCC software and Argonne National Laboratory (ANL) mpich are shown in the Figure 2. The overhead measurement time for Allgatherv communication primitive are shown in the Figure 3.

Figure 3: Performance of AllGatherv communication primitive.

Figure 4 illustrates the overhead measurement time for AllReduce communication primitive for HPCC software and ANL mpich. It can be concluded that the overheads in measuring communication time with HPCC software is less in comparison to ANL mpich over Fast Ethernet.

Figure 4: Performance of Allreduce communication primitive.

The computational time for execution of Barrier communication primitive on 4 nodes of PARAM 10000 is approximately 90 ms and on 16 nodes, it is 890 ms using HPCC software.

Thus, the customized AM and C-MPI implementation based on specific hardware architecture and optimising MPI for shared memory is a good way to reduce the overheads on PARAM 10000 from message passing point of view. Also, C-MPI takes care of SMP features, which uses direct memory copy instead of going through an intermediate shared space and network, which is critical to improve communication performance of MPI on PARAM 10000.

 

NAS Benchmarks

The Numerical Aerodynamic Simulation (NAS version 2.x, Revision 2.3) Program, which is based at NASA Ames Research Center, is a large-scale effort of computational aerodynamics and is generally used for system benchmarks. NAS benchmark set consists of two major components: five parallel kernel benchmarks and three simulated application benchmarks. The simulated application benchmarks combine several computations in a manner that resembles the actual order of execution in certain important Computational Fluid Dynamics (CFD) application codes. The NPB (NAS Parallel Benchmarks) suite consists of five kernels (EP, MG, CG, FT, IS) and three simulated applications (LU, SP, BT) programs. The performance, in terms of MFlops for selective benchmarks of NAS using HPCC software is given below in Table 1.

Routine/Problem Size
HPCC with PARAMNet (MFlops)
No. Of Processors
8
16
32
MG A   
B   
584
869
1311
629
939
1416
LU A   
B   
772
1502
2619
753
1481
1502

Table 1: Performance of NAS benchmarks on PARAM 10000.

The NAS codes are further optimized using other professional performance libraries and code restructuring. The further level of optimization i.e., tuning to PARAM 10000 architecture improves the performance by almost 10% to 15% with HPCC software on 32 processors of PARAM 10000.

 

LINPACK Benchmark

As a yardstick of performance the `best' performance as measured by the LINPACK Benchmark is presented. Jack Dongarra introduced the LINPACK Benchmark and a detailed description as well as a list of performance results on a wide variety of computing machines is available in postscript form from netlib at http://www.netlib.org. LINPACK was chosen because it is widely used and performance numbers are available for almost all relevant systems. LINPACK subroutines analyze and solve linear system of matrix equations by LU factorization. It is simple and easy to use, yet a good indicator of the numerical computing capability of a Parallel system. The latest version of the parallel implementation of LINPACK was downloaded and it was implemented on PARAM 10000. The details of this benchmark can be found at http://www.netlib.org/benchmark/hpl/.

The benchmark used in the LINPACK Benchmark is to solve a dense system of linear equations. For the TOP500 list of computing machines, that version of the benchmark is used that allows the user to scale the size of the problem and to optimize the software in order to achieve the best performance for a given machine. This performance reflects the performance of a dedicated system for solving a dense system of linear equations. Since the problem is very regular, the performance achieved is quite high, and the performance numbers give a good correlation of peak performance. By measuring the actual performance for different problem sizes n, a user can get not only the maximal achieved performance Rmax for the problem size Nmax but also the problem size N1/2 where half of the performance Rmax is achieved. These numbers together with the theoretical peak performance Rpeak are the numbers given in the TOP500.

Performance of 400 MFlops is achieved on one processor of PARAM 10000. A sustained power of 9.6 GFlops is achieved on 32 processors of PARAM 10000, while its Peak performance is 19.2 GFlops. Further improvement is brought about by using several optimization techniques, which result in achieving sustained performance of 10.5 GFlops on 32 processors (8 nodes) of PARAM 10000, which is equivalent to 60% of the peak performance.

For large configuration, i.e., beyond 32 processors, (8 nodes), multiple PARAMNet switches have been used to make cluster of 16 nodes (64 processors), and 32 nodes (128 processors). The sustained performance of hpl on 128 processors is 31.5 GFlops whereas the peak performance is 76.8 GFlops. For complete configuration of PARAM 10000, i.e. 100 GFlops, the sustained performance is 39.8 GFlops. The summary in Table 2 is given below:

No. Of Nodes/
processors
Approx. Sustained Performance in GFlops
Approx. Peak Performance in GFlops
8/32
10
19
16/64
19
38
32/128
32
76
40/160
40
100

Table 2: Performance of LINPACK benchmark on PARAM 10000.

We expect to achieve sustained performance improvement by 25% using PARAMNet-II technology, as we also expect these figures to vary on higher side with other optimized applications.

Legal Notices | Privacy Policy | © 2010 C-DAC. All rights reserved.
NPSF
NPSF Objectives
PARAM 10000
PARAM Anant
NPSF Technical Affiliation Scheme
Users at NPSF
Virtual Walkaround of NPSF