Publication

A Minimally Intrusive Low-Memory Approach to Resilience for Existing Transient Solvers

Related concepts (21)

Exascale computing refers to computing systems capable of calculating at least "1018 IEEE 754 Double Precision (64-bit) operations (multiplications and/or additions) per second (exaFLOPS)"; it is a measure of supercomputer performance. Exascale computing is a significant achievement in computer engineering: primarily, it allows improved scientific applications and better prediction accuracy in domains such as weather forecasting, climate modeling and personalised medicine.

Petascale computing

Petascale computing refers to computing systems capable of calculating at least 1015 floating point operations per second (1 petaFLOPS). Petascale computing allowed faster processing of traditional supercomputer applications. The first system to reach this milestone was the IBM Roadrunner in 2008. Petascale supercomputers were succeeded by exascale computers. Floating point operations per second (FLOPS) are one measure of computer performance.

Fugaku (supercomputer)

Fugaku 富岳 is a petascale supercomputer at the Riken Center for Computational Science in Kobe, Japan. It started development in 2014 as the successor to the K computer and made its debut in 2020. It is named after an alternative name for Mount Fuji. It became the fastest supercomputer in the world in the June 2020 TOP500 list as well as becoming the first ARM architecture-based computer to achieve this. At this time it also achieved 1.42 exaFLOPS using the mixed fp16/fp64 precision HPL-AI benchmark.

Manycore processor

Manycore processors are special kinds of multi-core processors designed for a high degree of parallel processing, containing numerous simpler, independent processor cores (from a few tens of cores to thousands or more). Manycore processors are used extensively in embedded computers and high-performance computing. Manycore processors are distinct from multi-core processors in being optimized from the outset for a higher degree of explicit parallelism, and for higher throughput (or lower power consumption) at the expense of latency and lower single-thread performance.

Parallel computing

Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. There are several different forms of parallel computing: bit-level, instruction-level, data, and task parallelism. Parallelism has long been employed in high-performance computing, but has gained broader interest due to the physical constraints preventing frequency scaling.

Message Passing Interface

Message Passing Interface (MPI) is a standardized and portable message-passing standard designed to function on parallel computing architectures. The MPI standard defines the syntax and semantics of library routines that are useful to a wide range of users writing portable message-passing programs in C, C++, and Fortran. There are several open-source MPI implementations, which fostered the development of a parallel software industry, and encouraged development of portable and scalable large-scale parallel applications.

Frontier (supercomputer)

Hewlett Packard Enterprise Frontier, or OLCF-5, is the world's first and fastest exascale supercomputer, hosted at the Oak Ridge Leadership Computing Facility (OLCF) in Tennessee, United States and first operational in 2022. It is based on the Cray EX and is the successor to Summit (OLCF-4). , Frontier is the world's fastest supercomputer. Frontier achieved an Rmax of 1.102 exaFLOPS, which is 1.102 quintillion operations per second, using AMD CPUs and GPUs. Measured at 62.

Computer performance by orders of magnitude

This list compares various amounts of computing power in instructions per second organized by order of magnitude in FLOPS. Scientific E notation index: 2 | 3 | 6 | 9 | 12 | 15 | 18 | 21 | 24 | >24 TOC 5×10−1: Computing power of the average human mental calculation for multiplication using pen and paper 1 OP/S: Power of an average human performing calculations using pen and paper 1 OP/S: Computing power of Zuse Z1 5 OP/S: World record for addition set 5×101: Upper end of serialized human perception computation (light bulbs do not flicker to the human observer) 2.

Fault tolerance

Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of one or more faults within some of its components. If its operating quality decreases at all, the decrease is proportional to the severity of the failure, as compared to a naively designed system, in which even a small failure can cause total breakdown. Fault tolerance is particularly sought after in high-availability, mission-critical, or even life-critical systems.

Lustre (file system)

Lustre is a type of parallel , generally used for large-scale cluster computing. The name Lustre is a portmanteau word derived from Linux and cluster. Lustre file system software is available under the GNU General Public License (version 2 only) and provides high performance file systems for computer clusters ranging in size from small workgroup clusters to large-scale, multi-site systems. Since June 2005, Lustre has consistently been used by at least half of the top ten, and more than 60 of the top 100 fastest supercomputers in the world, including the world's No.

Navier–Stokes equations

The Navier–Stokes equations (nævˈjeː_stəʊks ) are partial differential equations which describe the motion of viscous fluid substances, named after French engineer and physicist Claude-Louis Navier and Irish physicist and mathematician George Gabriel Stokes. They were developed over several decades of progressively building the theories, from 1822 (Navier) to 1842-1850 (Stokes). The Navier–Stokes equations mathematically express momentum balance and conservation of mass for Newtonian fluids.

Message passing

In computer science, message passing is a technique for invoking behavior (i.e., running a program) on a computer. The invoking program sends a message to a process (which may be an actor or object) and relies on that process and its supporting infrastructure to then select and run some appropriate code. Message passing differs from conventional programming where a process, subroutine, or function is directly invoked by name. Message passing is key to some models of concurrency and object-oriented programming.

TOP500

The TOP500 project ranks and details the 500 most powerful non-distributed computer systems in the world. The project was started in 1993 and publishes an updated list of the supercomputers twice a year. The first of these updates always coincides with the International Supercomputing Conference in June, and the second is presented at the ACM/IEEE Supercomputing Conference in November.

Message-oriented middleware

Message-oriented middleware (MOM) is software or hardware infrastructure supporting sending and receiving messages between distributed systems. MOM allows application modules to be distributed over heterogeneous platforms and reduces the complexity of developing applications that span multiple operating systems and network protocols. The middleware creates a distributed communications layer that insulates the application developer from the details of the various operating systems and network interfaces.

Message queue

In computer science, message queues and mailboxes are software-engineering components typically used for inter-process communication (IPC), or for inter-thread communication within the same process. They use a queue for messaging – the passing of control or of content. Group communication systems provide similar kinds of functionality. The message queue paradigm is a sibling of the publisher/subscriber pattern, and is typically one part of a larger message-oriented middleware system.

Single point of failure

A single point of failure (SPOF) is a part of a system that, if it fails, will stop the entire system from working. SPOFs are undesirable in any system with a goal of high availability or reliability, be it a business practice, software application, or other industrial system. Systems can be made robust by adding redundancy in all potential SPOFs. Redundancy can be achieved at various levels. The assessment of a potential SPOF involves identifying the critical components of a complex system that would provoke a total systems failure in case of malfunction.

Finite element method

The finite element method (FEM) is a popular method for numerically solving differential equations arising in engineering and mathematical modeling. Typical problem areas of interest include the traditional fields of structural analysis, heat transfer, fluid flow, mass transport, and electromagnetic potential. The FEM is a general numerical method for solving partial differential equations in two or three space variables (i.e., some boundary value problems).

Prototype

A prototype is an early sample, model, or release of a product built to test a concept or process. It is a term used in a variety of contexts, including semantics, design, electronics, and software programming. A prototype is generally used to evaluate a new design to enhance precision by system analysts and users. Prototyping serves to provide specifications for a real, working system rather than a theoretical one. In some design workflow models, creating a prototype (a process sometimes called materialization) is the step between the formalization and the evaluation of an idea.

High availability

High availability (HA) is a characteristic of a system that aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period. Modernization has resulted in an increased reliance on these systems. For example, hospitals and data centers require high availability of their systems to perform routine daily activities. Availability refers to the ability of the user community to obtain a service or good, access the system, whether to submit new work, update or alter existing work, or collect the results of previous work.

Tianhe-1

Tianhe-I, Tianhe-1, or TH-1 (, tian1he2-yi1hao4; Sky River Number One) is a supercomputer capable of an Rmax (maximum range) of 2.5 peta FLOPS. Located at the National Supercomputing Center of Tianjin, China, it was the fastest computer in the world from October 2010 to June 2011 and was one of the few petascale supercomputers in the world. In October 2010, an upgraded version of the machine (Tianhe-1A) overtook ORNL's Jaguar to become the world's fastest supercomputer, with a peak computing rate of 2.