ParCo marks a quarter of a century of the international conferences on parallel computing that started in Berlin in The aim of the conference is to give an overview of the state-of-the-art of the developments, applications and future trends in high performance computing for all platforms. The conference addresses all aspects of parallel computing, including applications, hardware and software technologies as well as languages and development environments.

Special emphasis was placed on the role of high performance processing to solve real-life problems in all areas, including scientific, engineering and multidisciplinary applications and strategies, experiences and conclusions made with respect to parallel computing. The three organizations that constitute the Joint Institute for Computational Fundamental Science JICFuS have made enormous contributions to advancing the computational fundamental sciences.

The Center for Computational Sciences CCS of the University of Tsukuba developed the dedicated CP-PACS massively parallel computer system for particle physics research, through the collaborative effort of physicists and computer scientists at its precursor organization, the Center for Computational Physics.

In addition to capturing the number one spot in the Top supercomputers list, the CP-PACS system delivered breakthrough results such as the calculation of hadron masses. The High Energy Accelerator Research Organization KEK has been using supercomputers since the s, and continues to provide research resources for researchers across Japan involved in fundamental research relating to particle physics, nuclear physics, and astrophysics.

JICFuS brings together these three established organizations to achieve the following objectives:. Therefore, the discussion is based on an extrapolation of trends in current microelectronic technologies. This evolution is relatively simple all that is required is integration , no new hardware or software technology needs to be developed, and old software runs with, at most, minor changes. As different parts of the system scale at different rates, new bottlenecks appear.

For example, if processor speed increases but the interconnect is not improved, then global communication may become a bottleneck. At some point, parametric evolution breaks down and qualitative changes to hardware and software are needed. For example, as memory latency measured in processor cycles increases, at some point a latency-hiding mechanism is needed to sustain reasonable performance on nonlocal applications.

At this point, vectors, multithreading, or some other mechanism is added to the architecture. Such a change is complex, requiring a change in software, usually in both systems software including compilers and applications software. Similarly, increased latency may necessitate different software mechanisms, such as dynamic load balancing.

Table 5. These extrapolations are for explanatory purposes only and do not represent detailed technology assessments. In particular, physical limits, such as the electromagnetic radiation that a Gflops chip might emit, are not considered. However, the committee expects that processor chips will compensate for that by putting several processors on each chip to continue to scale the performance per chip at 59 percent annually.

The numbers in Table 5. By , loads memory reads will need to be executed concurrently to keep memory bandwidth busy while waiting for memory latency, and 1, floating-point arithmetic operations can be performed during this time. By , loads must be in flight, and 94, arithmetic operations can be performed while waiting on memory. These numbers are not sustainable. It is clear that systems derived using simple parametric evolution are already greatly strained and will break down completely by At the high end, the number of processors is increasing approximately 20 percent per year.

The committee sees no technology limits that would cause this trend to change. Extrapolating this trend to indicates a number of processors in the , range; since each of them will have significant amounts of concurrency for latency hiding, systems will run tens of millions of concurrent operations. Figures 5. The numbers were collected by K. Yelick from the following sources: L. Oliker et al. Bell et al. Culler et al.

Dongarra and T. The committee considers MPI measurements because the algorithmic models below are based on message passing programs. Least-squares fits to the data show an annual improvement of 28 percent in latency and 29 percent in bandwidth, albeit with substantial variation. R 2 values for the formulas are 0. The improvement rates for lower-level communication systems e. The committee summarized the expected evolution of parallel systems in Table 5. A later section will discuss these extrapolations in more detail. For now, the committee simply points out that even if the individual components continue to improve parametrically, the overall system will see radical changes in how they are balanced.

Parametric evolution of the system as a whole is unsustainable, and current machines arguably have already moved into a problematic region of the design space. Furthermore, light traverses 60 m in nsec, less than the diameter of the largest supercomputer installations; the decrease in general latency will slow down as one approaches this limit. However, even the numbers are grossly inaccurate; they clearly show that a parametric evolution of current communication architectures is not sustainable. Improvements in algorithms can sometimes improve performance as much as or more than improvements in hardware and software do.

While such dramatic breakthroughs are hard to predict, the rewards can be significant. Further research can lead to such breakthroughs in the many complicated domains to which supercomputers are applied. For some fields, the algorithms now in use will not solve the most challenging problems, even if they are run on the most capable systems expected to be available in a foreseeable future. For other fields, satisfactory algorithms of any kind remain to be developed.

While these algorithmic needs arise from quite different application areas, they often have much in common. The committee first describes the nature of the algorithms in common use, including their demands on the underlying hardware, and then summarizes some of their shortcomings and future challenges. Differential equations are the fundamental equations for many problems governed by the basic laws of physics and chemistry. A Poisson equation is an equation that arises in models of many physical systems, including heat flow, fluid flow, diffusion, electrostatics, and gravity.

Note on O. An algorithm that runs in time O n 2 will be much slower than an algorithm that runs in time O n once n is large enough, no matter what their respective constants are, which is why we use the O. DOE, Office of Science. Alternatively, the solution could be represented by a collection of particles, vortices, or other discrete objects. These equations arise, for example, in fusion, accelerator design, nuclear physics, weapons design, global climate change, reactive chemistry, astrophysics, nanotechnology, contaminant transport, material science, drug design, and related fields. A more recent variation on this theme is stochastic differential equations, where one or more of the terms represent a random process of some kind, like diffusion.

In this case the goal is to compute certain statistics about the set of possible solutions. Included in this category of algorithms is work on new ways to discretize the equations and work on fast solution methods, such as multigrid and other multilevel methods, which use a hierarchy of meshes.

The demands these algorithms place on hardware depend both on the method and on the differential equation. Elliptic partial differential equations PDEs , of which the aforementioned Poisson equation is the canonical example, have the property that the solution at every mesh point depends on data at every other mesh point, which in turn places demands on memory and network bandwidth. On the other hand, the data at distant mesh points can often be compressed significantly without losing much accuracy, ameliorating bandwidth needs a property exploited both by multigrid methods and by some of the fast transforms discussed below.

In contrast to elliptic equations, time-dependent equations may e. Some time-dependent equations e. Even without global dependence, a time-dependent equation with a rapidly changing solution solved with a mesh that adapts to the solution may again have high bandwidth demands in order to support load balancing see below. This variety of behaviors can be found in many of the ASC codes.

Long-standing open problems include overcoming the need for tiny femtosecond time steps in molecular dynamics simulations 18 and finding better anisotropic radiation transport algorithms than flux-limited diffusion, discrete ordinates S n , or Monte Carlo, 19 among many others. The desire to solve larger systems of equations describing more complicated phenomena not all of which may be represented or discretized the same way on more complicated domains spurs ongoing innovation in this area.

The committee considered both generating the above-mentioned initial mesh and adapting it during the solution phase. As for time to solution, it is often the process of generating the initial mesh that takes the most time. This is because it often requires a great deal of human intervention to create a suitable geometric model of a complicated physical system or object. The shuttle in particular has benefited from recent breakthroughs in mesh generation, 20 but many problems remain in producing three-dimensional meshes with guaranteed geometric and mathematical properties and in doing so efficiently in parallel or when memory is limited.

In addition to generating the initial mesh, hierarchies of meshes are needed for multigrid and multilevel methods, and producing these hierarchies in an automatic fashion so as to appropriately approximate the solution at each level of resolution is challenging. When the mesh represents a deforming material, algorithms are needed to deform the mesh as. McMillan et al. Molecular dynamic simulations use time steps of a few femtoseconds; some phenomena, such as protein folding, take many milliseconds.

Meshes are also sometimes adapted during the solution process to have higher resolution more points in regions where the solution is complicated and fewer points in simple regions. The complicated region can move during solution; an example is the intricate flame front between burnt and unburnt gas in an internal combustion engine. Effective use of large numbers of parallel processors in these algorithms is an ongoing challenge, because the workload and load im balance changes unpredictably with the position of the complicated region. This class of algorithms for solving linear systems of equations, least squares problems, and eigenvalue problems in which all equations involve all or most variables, is epitomized by the Linpack benchmark discussed elsewhere in this report.

These algorithms are among the least sensitive to memory and network bandwidth of any discussed here, provided the problems are large enough. Dense linear algebra still forms a significant fraction but not majority of the workload at some supercomputer centers.

For example, NERSC reports that materials science applications representing 15 percent of their total cycles spend 90 percent of their time in dense linear algebra routines today. It is worth noting that even in this relatively mature field, only a relatively small fraction of the algorithms with good sequential software implementations have good parallel software implementations.

The discrete equations on a mesh arising in a discretized differential equation are typically sparse i. It is critical to exploit this mathematical structure to reduce memory and arithmetic operations, rather than using dense linear algebra. In other words, an ideal algorithm performs just a constant amount of work per nonzero and communicates very little. Whether in fact a reasonably efficient let alone ideal algorithm can be found depends strongly on the structure of the equations namely, which variables appear and with what coefficients , so there is a large set of existing algorithms corresponding to the large variety of problem structures.

General solution techniques e. However, they remain in widespread use because of their reliability and ease of use. Iterative methods, which typically rely on the more scalable operation of matrix vector multiplication, can be much faster but often require careful problem-dependent design to converge in a reasonable number of iterations. As new exploitable problem structures arise and computer architectures change, algorithmic innovation is ongoing. Discrete algorithms are distinguished from others in this summary by having few, if any, floating-point numbers required to define the inputs or outputs to the problem.

Discrete algorithms can involve a wide array of combinatorial optimization problems arising in computational biology for instance, looking for nearly matching sequences , the analysis of large data sets finding clusters or other patterns in high-dimensional data sets , or even other parallel computing algorithms balancing the workload or partitioning a sparse matrix among different parallel processors.

Many of these problems are NP-hard non-deterministic polynomial-time hard , meaning that an optimal solution would take impractically long to compute on any foreseeable computer, so that heuristic approximations are required. Again, the diversity of problems leads to a diversity of algorithms perhaps involving floating point and an ongoing potential for innovation. Barrett, M. Berry, T. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout, R. Pozo, C. Romine, and H.

Other discrete algorithms involve number theory arising in cryptanalysis , symbolic algorithms for exact solutions to algebraic equations arising in the intelligence community and elsewhere , and discrete event simulation and agent-based modeling arising in traffic, epidemiology, and related simulations.

It appears that relatively little work at least work that has been made public has been done to parallelize symbolic algorithms. There are a variety of widely used fast transform methods—such as the fast Fourier transform FFT , wavelets, the fast multipole method, kernels arising in quantum chemistry, and their numerous variations—where a clever reformulation changes, for example, an O n 2 algorithm into an O n log n algorithm.

These reformulations exploit the underlying mathematical or physical structure of the problem to represent intermediate results in compressed forms that are faster to compute and communicate. A recent big advance is O n methods in electronic structures calculations. It is an ongoing challenge to adapt these methods to new problem structures and new computer architectures.

Some of these algorithms e.

- Handbook of Brand Relationships.
- ISBN 13: 9780387967653?
- Sid Fernbach Award.

Fastest Fourier transform in the West FFTW 24 is a successful example of a system for automatically adapting an FFT algorithm to perform well on a particular problem size and a particular computer. In addition to opportunities to improve algorithms as described above in the categories of differential equations, mesh generation, linear algebra, discrete algorithms, and fast transforms , there are new, cross-cutting algorithmic needs driven by supercomputing that are common to many application areas. One reason for needing increased supercomputer performance is that many applications cannot be run using realistic parameter ranges of spa-.

For many such applications, applying more computer power with substantially the same algorithms can significantly increase simulation quality. For example, mesh resolution can be increased. But the need for higher-resolution analyses may also lead to the need for faster algorithms. For example, solving a problem 10 times larger than currently possible would require 10 times as powerful a machine using an algorithm with complexity O n but times as powerful a machine using an algorithm with complexity O n 2.

It is sometimes possible to use physics-based algorithms like the fast multipole method or physics-based preconditioners that exploit particular properties of the equations being solved. One important area needing research is scalable adaptive methods, where the computational work adapts depending on the complexity of the physical solution, making load balancing difficult as the solution changes over time.

But in other applications, increased mesh resolution may require the development of new physics or algorithms to resolve or approximate phenomena at tiny scales. In some cases, submodels of detailed processes may be required within a coarser mesh e. Sometimes completely different physical models may be required e. In some problems such as turbulence , physically unresolved processes at small length or time scales may have large effects on macroscopic phenomena, requiring approximations that differ from those for the resolved processes.

A similar example arises in molecular dynamics, where the molecular motions at the shortest time scales must currently be computed at intervals of 10 —15 seconds to resolve reactions that may take a second or more; a new algorithm is needed to avoid the current bottleneck of 10 15 sequential steps. Many real-world phenomena involve two or more coupled physical processes for which individual models and algorithms may be known clouds, winds, ocean currents, heat flow inside and between the atmosphere and the ocean, atmospheric chemistry, and so on but where the coupled system must be solved.

Vastly differing time and length scales of the different disciplinary models frequently makes this coupled model much harder to solve. Emerging application areas also drive the need for new algorithms and applications. Bioinformatics, for example, is driving the need to couple equation-driven numerical computing with probabilistic and constraint-driven computing.

After one has a model that can be used to analyze predict the behavior of a physical system such as an aircraft or weapons system , it is often desirable to use that model to try to synthesize or optimize a system so that it has certain desired properties, or to discover how sensitive the behavior is to parameter changes.

Such a problem can be much more challenging than analysis alone.

### IN ADDITION TO READING ONLINE, THIS TITLE IS AVAILABLE IN THESE FORMATS:

As an example, a typical analysis computes, from the shape of an airplane wing, the lift resulting from airflow over the wing by solving a differential equation. The related optimization problem is to choose the wing shape that maximizes lift, incorporating the constraints that ensure that the wing can be manufactured. Solving that problem requires determining the direction of change in wing shape that causes the lift to increase, either by repeating the analysis as changes to shape are tried or by analytically computing the appropriate change in shape. Similar optimization problems can arise in any manufacturing process, as can parameter identification problems e.

This transition to synthesis, sensitivity analysis, and optimization requires improved algorithms in nonlinear solvers, mathematical optimization techniques, and methods for quantifying uncertainty. Many fields e. These data may be represented by a diversity of data structures, including tables of numbers, irregular graphs, adaptive meshes, relational databases, two- or three-dimensional images, text, or various combined representations. Extracting scientific meaning from these data requires coupling numerical, statistical, and logical modeling techniques in ways that are unique to each discipline.

A machine model is the set of operations and their costs presented to the programmer by the underlying hardware and software. Algorithmic research has traditionally sought to minimize the number of arithmetic or logical operations. However, the most expensive operation on a machine is not arithmetic but, rather, fetching data from memory, especially remote memory. Furthermore, the relative costs of arithmetic and fetching data can change dramatically between machines and over time. Sometimes this means that the fastest algorithm must compress data that are needed far away before communicating them; this compression often involves approximations which one must carefully bound that rely on the detailed physics or other mathematical structure of the problem.

The fast multipole method and multigrid algorithms are celebrated and widely used examples of this technique. In these examples, reducing arithmetic and reducing data fetching go hand in hand. But there are yet other examples e. This optimization process could involve adjusting a few parameters in the algorithm describing data layouts, running a combinatorial optimization scheme to rebalance the load, or using a completely different algorithm that trades off computation and communication in different ways.

Successful tuning by hand is typically a tedious process requiring familiarity with everything from algorithms to compilers to hardware. Some success has been achieved in automating this process, but only for a few important algorithmic kernels, such as ATLAS 26 for matrix-matrix multiplication or FFTW for fast Fourier transforms. Work is needed on these adaptive algorithms to make them more broadly applicable and available to more users. The software used for computing in general and supercomputing in particular has multiple purposes.

The system software—the operating system, the scheduler, the accounting system, for example—provide the infrastructure for using the machine, independently of the particular applications for which it is used. The programming languages and tools help the user in writing and debugging applications and in understanding their performance. The applications codes directly implement the application. The software system is sometimes described as a stack of abstractions, in the sense that the operating system is the lowest level, programming lan-. Richard Vuduc, James W.

Demmel, Katherine A. November Each of the conceptual layers is important in the overall system, and each layer in a supercomputer system has special characteristics that distinguish it from the layers in other kinds of computing systems.

## Download Japanese Supercomputing Architecture Algorithms And Applications

Supercomputing software has many requirements in common with software for other computing systems. Layered abstractions provide higher-level operations for most users, allowing them to reuse complex operations without needing the deep knowledge of the specialists writing the lower levels.

Portability is essential, since many programs outlast their original platforms. In the supercomputing arena, a computer has a typical useful lifetime of 5 years, while many-decades-old applications codes are still in daily use. Execution efficiency is important in all areas, particularly for supercomputers, because of the high cost of the systems and the heavy demands of the applications. Ensuring correct results, a problem on all computers, is of course especially difficult on a large, complex system like a supercomputer.

Other issues are unique to supercomputer software. Foremost among these is the requirement for excellent scalability at all levels of the software. To benefit from parallel hardware, the software must provide enough concurrent operations to use all the hardware.

For example, a supercomputer with a thousand processors needs many thousands of operations available for execution at all times—or many tens of thousands if custom processors are used. Future systems may push this degree of concurrency to , or 1 million processors and beyond, and the concurrency level within each processor will need to increase in order to hide the larger memory latency.

They provide services such as memory and process management to enable multiple executing programs to share the system and abstractions such as interfaces and file systems that both facilitate the programming layers above and reduce hardware dependence. Other key services they provide are security and protection, logging, and fault tolerance. Closely associated with those operating system roles is the management software. Key components include user accounts, queuing systems, system monitors, and configuration management.

A few projects have created supercomputer-class clusters running versions of Microsoft Windows; a prominent example of such a system is at the Cornell Theory Center, th on the June TOP list. Management software for supercomputing is quite varied. Even among sites that use the same management tools, the configurations—for instance, the number of queues and the policies that control them—differ substantially. Although there are open source versions of some of these tools, most production sites use proprietary management software even if they use open source software such as Linux for other software components.

This is probably due to limitations of the open source tools. Management software for supercomputing typically uses straightforward extensions or improvements to software for smaller systems, together with policies tailored to their user community. It is challenging to scale an operating system to a large number of processors. A modern operating system is a complex multithreaded application with asynchronous, event-driven logic, many sequential bottlenecks, and little data locality.

It is hard to scale such an application, and even harder to do so while maintaining full compatibility with a broadly used commercial operating system such as Linux or Windows. Many of the operating system services and the programming tools need to scale as the number of concurrent threads that are created. Thus, custom systems that achieve a given level of performance with fewer concurrent threads facilitate the scaling of these subsystems.

Large supercomputers are typically managed by multiple operating system images, each controlling one node. A single-system image common file space, single login, single administrative point of control, etc. This approach creates a fundamental mismatch between the virtual machine provided by the operating system, which is loosely. A key manifestation of this mismatch is the lack of concurrent scheduling.

Most existing parallel programming models implicitly assume that the application controls a dedicated set of processors executing at the same speed. Thus, many parallel codes consist of an alternation of compute phases, where an equal amount of computation work is performed by each process and by global communication and synchronization phases. But a computation that frequently uses global synchronizations cannot tolerate nonsynchronized variations in computation speed that are due, for example, to asynchronous system activities daemons, page misses, and so on.

For example, suppose that each node spends 1 percent of its time handling system events and each event requires five times as long as it takes to execute a barrier a synchronization of all active processes.

- Seeking Researchers.
- Breastfeeding: Contemporary Issues in Practice and Policy?
- Disaster Mental Health Services: A Primer for Practitioners (Psychosocial Stress Series).

If these system events occur simultaneously at all nodes, then the global loss of performance is 1 percent as one might expect. However, if they occur at random times in a node computation, then each barrier is statistically expected to be preceded by a system event, effectively raising the synchronization cost percent. The effect is smaller on smaller systems, but still significant; for example, a node system in the same circumstances would see a percent synchronization cost increase. A programmer without detailed knowledge of the underlying operating system would be unable to design an appropriate program to compensate for this variation.

Most supercomputer manufacturers IBM, Hewlett-Packard, Cray were surprised to encounter this problem on their systems, and most resolved it by various ad hoc means. Some supercomputers run a microkernel on the compute nodes that reroutes many system functions to a service node running the full-function operating system. This approach reduces asynchronous system events on the compute nodes and also reduces the frequency of software failures.

The implicit assumption in this approach is that page faults can be virtually eliminated. The model is increasingly inappropriate for complex, dynamic, heterogeneous applications. Changes in operating system structures to reduce asynchrony or in programming models to tolerate asynchrony or, likely, both will be required.

Indeed, the more recent programming languages described in the next section tend to allow looser synchronization. However, it remains for applica-. A programming model is an abstract conceptual view of the structure and operation of a computing system. For example, a uniform shared memory or a global addressing model supports the abstraction that there is one uniformly addressable storage even though there may be multiple physical memories being used.

The use of a given programming model requires that the operating system, the programming languages, and the software tools provide the services that support that abstraction. In the context of this discussion, the programming languages at issue are the ones in which applications are written, not the ones in which the systems software is written although the tools that support the applications programming language must provide appropriate interfacing with systems software. Programming tools provide a means to create and run programs. Key tools include compilers, interpreters, debuggers, and performance monitors.

The programming languages and tools for supercomputing are diverse. Many applications, in fact, use components written in more than one language. A useful taxonomy of languages might be based on the parts of the supercomputer under the control of language operations. Sequential imperative languages, such as C and Fortran, are commonly used to program individual processing elements which may be single-processor nodes or threads in multithreaded systems.

## The race to exascale: A story of superpowers and supercomputers - DCD

Nodes consisting of several processors with a shared memory are typically programmed using modest extensions to these languages, such as the OpenMP 27 extensions, which have bindings for C and Fortran. Collections of nodes or processors that do not share memory are programmed using calls to run-time system libraries for message passing, such as MPI, or other communications paradigms for instance, one-sided communication. There has been some progress in the use of better-integrated parallel languages.

There are also research languages based on Java, such as Titanium. These languages support a shared memory model, in the sense that a single partitioned global name space allows all executing threads to access large shared data stored in shared but distributed arrays. At the full-system level, scripting languages e. Of course, each language requires its own compiler and development tools. Some sharing of tools is possible in principle but less common in practice. Parallel programming languages and parallel programming models are necessarily compromises between conflicting requirements.

Although many of the current compromises are deemed to be inadequate, it is not clear what a better solution should be. The use of dialects of Fortran and C stems, in part, from a desire to migrate legacy software and tools and to exploit existing user expertise. The ecosystem complexity described in Chapter 6 makes it difficult to experiment with new approaches.

To improve programmer productivity it would be desirable to have languages with a higher level of abstraction. It is not possible to understand the behavior of 10, concurrent threads that may interact in unexpected ways. Although many bugs can be found on smaller runs, some problems only manifest themselves at large scales; therefore, the ASC program since at least has listed support for thousands of processors as one of its top requirements for parallel debuggers.

For example, a compiler for a sequential language that generates code for a vector machine may be very sensitive to exactly how the program is written, whereas a language with vector operations makes that form of parallelism explicit. Even if the compiler determines that a. It is desirable to have portability across platforms.

Portability is needed both to leverage the variety of platforms that a community may have access to at a given point in time and to handle hardware evolution. Supercomputer applications often outlive the hardware they were designed for: A typical application may be used for 20 years, while the useful life of a supercomputer is more often 5 years. Currently, the oldest machine on the June TOP list is 7 years old.

By the same token, one wants good performance on each platform. Parallel platforms are distinguished not only by low-level details such as the precise instruction set of processors; they also may have very different performance characteristics, with numbers of processors ranging from just a few to ,, with global latency ranging from hundreds of cycles to more than 10, cycles, and so on. In some cases, different algorithms are needed to accommodate such large differences. In general, the more disparate the set of target platforms, the harder it is for a single computer program to be mapped efficiently onto all target platforms by compiler and run time; some user help is needed.

In practice, supercomputer codes written to port to a multiplicity of platforms contain multiple versions tuned for different platforms. It is not clear how one improves this situation in general. Some research projects have attempted this in specific cases, including packages that generate tuned versions of library codes such as ATLAS 36 and FFTW, 37 domain-specific program generators such as the Tensor Contraction Engine 38 and Spiral, 39 and dynamic code generators such as tcc. Supercomputer platforms differ not only in the relative performance of their interconnects but also in the communication mechanisms sup-.

Baumgartner, A. Auer, D. Bernholdt, A. Bibireata, V. Choppella, D. Cociorva, X. Gao, R. Harrison, S. Hirata, S. Krishnamoorthy, S. Krishnan, C. Lam, M. Nooijen, R. Pitzer, J. Ramanujam, P. Sadayappan, and A. In press.