|
PARAM
10000 High Performance Computing System commissioned
by
C-DAC in 1998
is a part of the National PARAM Supercomputing Facility
(NPSF) at Pune.
In order to assess the performance of the system expected
against diverse scientific and business computing applications,
the system has been benchmarked on certain key applications
and internationally accepted benchmarks. Macro and Micro
benchmarks have been used to measure the performance of
PARAM 10000.
A Macro benchmark
measures the performance of a computer system as a whole.
It compares different systems with respect to an application
class, and is a useful input for the user.
Micro benchmarks
tend to be of the synthetic kernels types that measure a
specific aspect of a computing system, such as CPU speed,
Memory speed, I/O speed, Operating system performance, and
Networking etc.
Usually the benchmark data
is quite accurate and sufficient for classifying systems
into groups of comparable systems according to their performance
on certain standard computing problems, which help in estimating
the performance of systems for a particular problem area.
The objective of the C-DAC
benchmark is to provide a common ground to test the system
and to investigate the effective use of real application
programs and also employ short kernel codes to evaluate
the sustained performance. To obtain sustained performance,
thus several benchmarks and test suites were executed on
PARAM 10000. The important benchmarks and their performance
on PARAM 10000 are presented here.
Following are the benchmarks
performed:
PARAM 10000 has peak computing
power of 100 Giga floating-point operations per second (GFLOPS).
The processors are based on SUN Microsystems's UltraSparc
(RISC) architecture, each operating at a clock speed of
300 MHz supporting Solaris OS 2.6. It has 36 compute nodes
and 4 server nodes. Each node is a quad-processor, SMP (shared
memory, symmetric multiprocessing) system. Each file server
has 4GB while each compute node has 2GB main memory. Each
node has 16 KB L1 cache and 2 MB L2 cache.
The summary results are
reproduced below in Figure 1 in a graphical representation,
while details are given in the succeeding paragraphs.
Different experiments were
performed to extract maximum performance on PARAM 10000
using C-DAC's HPCC (High Performance Computing and Communication)
software suite. The PARAMNet, an interconnect switch as
a system area network (SAN), was developed by C-DAC for
use with PARAM 10000. Its architecture is conceived as a
high-speed switched network built around a high speed and
low latency switch for cluster computing. In PARAMNet cluster
computing environment, message passing library support is
provided on TCP/IP, as well as lightweight protocol. The
system software consists of Optimized MPI (C-DAC MPI) on
Shared Memory and Active Messages (AM) over PARAMNet, provided
by a lightweight protocol known as KSHIPRA, developed by
C-DAC. These host interfaces are intelligent adapters and
employ C-DAC's Communication Co-processor (CCP) ASIC as
the main block. Thus, the implementation of MPI over AM
using PARAMNet with CCP plays a major role for performance
boost on clusters. The performance of benchmarks on PARAM
10000 is achieved by capturing the effects of workstation
architecture, the network protocols, implementation of MPI
over AM using PARAMNet with intelligent Network Interface
Card (NIC) based on CCP. The switch uses cut-through or
wormhole approach for routing of packets. It has 8 ports,
with each port supporting 400 + 400 Mbits/sec bi-directional
bandwidth.

Figure 1. Performance
of LINPACK (hplbench) on PARAM 10000.
The performance would further
improve using the next generation of PARAMNet-II designed
at C-DAC, which is based on a 16-port switch with each port
supporting 2.5+2.5 Gbits/sec bi-directional bandwidth and
a Network Interface Card, based on CCP-III. PARAMNet-II
is also expected to boost the network performance substantially
both in terms of bandwidth and latency.

P-COMS
Benchmarks
Several approaches have
been used to extract the performance of MPI point-to-point
communications on cluster. The MPI standard provides a flexible
environment for developing high performance parallel applications
with several mechanisms for point-to-point and collective
communications. PARAM - Communication Overhead
Measurement Suites (P-COMS) are set
of test programs, developed by C-DAC which extract the performance
of MPI point-to-point and collective communications on PARAM
10000. This communication model is simple but is advantageous
considerably because it is platform independent and can
be used for performance comparison of different MPI implementations.
The cost of MPI communication
primitives on PARAM 10000 is very low. It is achieved by
capturing the effects of workstation architecture, the network
protocols and implementation of MPI on AM over PARAMNet
with intelligent Network Interface Card (NIC) and C-DAC's
Communication Co-processor (CCP) ASIC. Further, the cost
of communication is reduced with usage of Optimised MPI
(C-MPI) on Shared Memory. Consequently, the system area
network and C-MPI implementation on PARAM 10000 is generally
superior to Fast Ethernet.
The start-up time
in microseconds (ms), bandwidth in MB/s, for small and large message
sizes, which characterize the point-to-point communication
overhead, have been measured. Latency, in the sense
of a message passing system, refers to the cost to set up
a message transmission or time taken for an operation. Start-up
time is also called as latency, which is the time in microseconds
to communicate a zero-byte or short message. A popular method
for measuring a point-to-point communication (e.g., between
processor 0 and processor 1) is the ping-pong scheme. On
PARAM 10000, the achieved latency is 25 ms and bandwidth is 35 MB/s using HPCC software at
an application layer. Interestingly, bandwidth of 40
MB/s has been achieved when all the 16 processors (4 nodes)
are used with a single PARAMNet switch for a cluster of
8 nodes.

Figure 2: Performance
of Scatterv communication primitive.
The results for Scatterv
communication primitive using HPCC software and Argonne
National Laboratory (ANL) mpich are shown in the Figure
2. The overhead measurement time for Allgatherv communication
primitive are shown in the Figure 3.

Figure 3: Performance
of AllGatherv communication primitive.
Figure 4 illustrates the
overhead measurement time for AllReduce communication
primitive for HPCC software and ANL mpich. It can be concluded
that the overheads in measuring communication time with
HPCC software is less in comparison to ANL mpich over Fast
Ethernet.
Figure 4: Performance
of Allreduce communication primitive.
The computational time
for execution of Barrier communication primitive on 4 nodes
of PARAM 10000 is approximately 90 ms and on 16 nodes, it is 890 ms using HPCC software.
Thus, the customized AM
and C-MPI implementation based on specific hardware architecture
and optimising MPI for shared memory is a good way to reduce
the overheads on PARAM 10000 from message passing point
of view. Also, C-MPI takes care of SMP features, which uses
direct memory copy instead of going through an intermediate
shared space and network, which is critical to improve communication
performance of MPI on PARAM 10000.

NAS
Benchmarks
The Numerical Aerodynamic
Simulation (NAS version 2.x, Revision 2.3) Program,
which is based at NASA Ames Research Center, is a large-scale
effort of computational aerodynamics and is generally used
for system benchmarks. NAS benchmark set consists of two
major components: five parallel kernel benchmarks and three
simulated application benchmarks. The simulated application
benchmarks combine several computations in a manner that
resembles the actual order of execution in certain important
Computational Fluid Dynamics (CFD)
application codes. The NPB (NAS Parallel Benchmarks) suite
consists of five kernels (EP, MG, CG, FT, IS) and three
simulated applications (LU, SP, BT) programs. The performance,
in terms of MFlops for selective benchmarks of NAS using
HPCC software is given below in Table 1.
|
Routine/Problem Size
|
HPCC with PARAMNet (MFlops)
|
|
No. Of Processors
|
|
8
|
16
|
32
|
|
MG A
B
|
584
|
869
|
1311
|
|
629
|
939
|
1416
|
|
LU A
B
|
772
|
1502
|
2619
|
|
753
|
1481
|
1502
|
|
Table 1: Performance
of NAS benchmarks on PARAM 10000.
The NAS codes are further
optimized using other professional performance libraries
and code restructuring. The further level of optimization
i.e., tuning to PARAM 10000 architecture improves the performance
by almost 10% to 15% with HPCC software on 32 processors
of PARAM 10000.

LINPACK
Benchmark
As a yardstick of performance
the `best' performance as measured by the LINPACK Benchmark
is presented. Jack Dongarra introduced the LINPACK Benchmark
and a detailed description as well as a list of performance
results on a wide variety of computing machines is available
in postscript form from netlib at http://www.netlib.org.
LINPACK was chosen because it is widely used and performance
numbers are available for almost all relevant systems. LINPACK
subroutines analyze and solve linear system of matrix equations
by LU factorization. It is simple and easy to use, yet a
good indicator of the numerical computing capability of
a Parallel system. The latest version of the parallel implementation
of LINPACK was downloaded and it was implemented on PARAM
10000. The details of this benchmark can be found at
http://www.netlib.org/benchmark/hpl/.
The benchmark used in the
LINPACK Benchmark is to solve a dense system of linear equations.
For the TOP500 list of computing machines, that version
of the benchmark is used that allows the user to scale the
size of the problem and to optimize the software in order
to achieve the best performance for a given machine. This
performance reflects the performance of a dedicated system
for solving a dense system of linear equations. Since
the problem is very regular, the performance achieved is
quite high, and the performance numbers give a good correlation
of peak performance. By measuring the actual performance
for different problem sizes n, a user can get not
only the maximal achieved performance Rmax
for the problem size Nmax but also the
problem size N1/2 where half of the performance Rmax
is achieved. These numbers together with the theoretical
peak performance Rpeak are the numbers
given in the TOP500.
Performance of 400
MFlops is achieved on one processor of PARAM 10000. A sustained
power of 9.6 GFlops is achieved on 32 processors
of PARAM 10000, while its Peak performance is 19.2
GFlops. Further improvement is brought about by using several
optimization techniques, which result in achieving sustained
performance of 10.5 GFlops on 32 processors
(8 nodes) of PARAM 10000, which is equivalent to 60%
of the peak performance.
For large configuration,
i.e., beyond 32 processors, (8 nodes), multiple
PARAMNet switches have been used to make cluster of 16
nodes (64 processors), and 32 nodes (128
processors). The sustained performance of hpl on
128 processors is 31.5 GFlops whereas the
peak performance is 76.8 GFlops. For complete
configuration of PARAM 10000, i.e. 100 GFlops, the
sustained performance is 39.8 GFlops. The summary
in Table 2 is given below:
|
No. Of Nodes/
processors
|
Approx. Sustained Performance
in GFlops
|
Approx. Peak Performance in
GFlops
|
|
8/32
|
10
|
19
|
|
16/64
|
19
|
38
|
|
32/128
|
32
|
76
|
|
40/160
|
40
|
100
|
|
Table 2: Performance
of LINPACK benchmark on PARAM 10000.
We expect to achieve sustained
performance improvement by 25% using PARAMNet-II technology,
as we also expect these figures to vary on higher side with
other optimized applications.

|