SCI Publications

Page 1 of 15

SCI Publications

2024

J. K. Holmen , M. Garcıa, A. Bagusetty, V. Madananth, A. Sanderson,, M. Berzins. “Making Uintah Performance Portable for Department of Energy Exascale Testbeds,” In Euro-Par 2023: Parallel Processing, pp. 1--12. 2024.

ABSTRACT

To help ease ports to forthcoming Department of Energy (DOE) exascale systems, testbeds have been made available to select users. These testbeds are helpful for preparing codes to run on the same hardware and similar software as in their respective exascale systems. This paper describes how the Uintah Computational Framework, an open-source asynchronous many-task (AMT) runtime system, has been modified to be performance portable across the DOE Crusher, DOE Polaris, and DOE Sunspot testbeds in preparation for portable simulations across the exascale DOE Frontier and DOE Aurora systems. The Crusher, Polaris, and Sunspot testbeds feature the AMD MI250X, NVIDIA A100, and Intel PVC GPUs, respectively. This performance portability has been made possible by extending Uintah’s intermediate portability layer [18] to additionally support the Kokkos::HIP, Kokkos::OpenMPTarget, and Kokkos::SYCL back-ends. This paper also describes notable updates to Uintah’s support for Kokkos, which were required to make this extension possible. Results are shown for a challenging radiative heat transfer calculation, central to the University of Utah’s predictive boiler simulations. These results demonstrate single-source portability across AMD-, NVIDIA-, and Intel-based GPUs using various Kokkos back-ends.

2023

N. Shingde, M. Berzins, T. Blattner, W. Keyrouz, A. Bardakoff. “Extending Hedgehog’s dataflow graphs to multi-node GPU architectures,” In Workshop on Asynchronous Many-Task Systems and Applications (WAMTA23), 2023.

ABSTRACT

Asynchronous task-based systems offer the possibility of making it easier to take advantage of scalable heterogeneous architectures.
This paper extends the National Institute of Standards and Technology’s Hedgehog dataflow graph models, which target a single high-end
compute node, to run on a cluster by borrowing aspects of Uintah’s cluster-scale task graphs and applying them to a sample implementation
of matrix multiplication. These results are compared to implementations using the leading libraries, SLATE and DPLASMA, for illustrative purposes only. The motivation behind this work is to demonstrate that using general purpose high-level abstractions, such as Hedgehog’s dataflow graphs, does not negatively impact performance.

2022

John Holmen. “Portable, Scalable Approaches For Improving Asynchronous Many-Task Runtime Node Use,” School of Computing, University of Utah, 2022.

ABSTRACT

This research addresses node-level scalability, portability, and heterogeneous computing challenges facing asynchronous many-task (AMT) runtime systems. These challenges have arisen due to increasing socket/core/thread counts and diversity among supported architectures on current and emerging high-performance computing (HPC) systems. This places greater emphasis on thread scalability and simultaneous use of diverse architectures to maximize node use and is complicated by architecture-specific programming models.

To reduce the exposure of application developers to such challenges, AMT programming models have emerged to offer a runtime-based solution. These models overdecompose a problem into many fine-grained tasks to be scheduled and executed by an underlying runtime to improve node-level concurrency. However, task execution granularity challenges remain, and it is unclear where and how shared memory programming models should be used within an AMT model to improve node use. This research aims to ease these design decisions with consideration for performance portability layers (PPLs), which provide a single interface to multiple shared memory programming models.

The contribution of this research is the design of a task scheduling approach for portably improving node use when extending AMT runtime systems to many-core and heterogeneous HPC systems with shared memory programming models. The success of this approach is shown through the portable adoption of a performance portability layer, Kokkos, within Uintah, a representative AMT runtime system. The resulting task scheduler enables the scheduling and execution of portable, fine-grained tasks across processors and accelerators simultaneously with flexible control over task execution granularity. A collection of experiments on current many-core and heterogeneous HPC systems are used to validate this approach and inform design recommendations. Among resulting recommendations are approaches for easing the adoption of a heterogeneous MPI+PPL task scheduling approach in an asynchronous many-task runtime system and furthermore to ease indirect adoption of a performance portability layer in large legacy codebases.

J.K. Holmen, D. Sahasrabudhe, M. Berzins. “Porting Uintah to Heterogeneous Systems,” In Proceedings of the Platform for Advanced Scientific Computing Conference (PASC22) Best Paper Award, ACM, 2022.

ABSTRACT

The Uintah Computational Framework is being prepared to make portable use of forthcoming exascale systems, initially the DOE Aurora system through the Aurora Early Science Program. This paper describes the evolution of Uintah to be ready for such architectures. A key part of this preparation has been the adoption of the Kokkos performance portability layer in Uintah. The sheer size of the Uintah codebase has made it imperative to have a representative benchmark. The design of this benchmark and the use of Kokkos within it is discussed. This paper complements recent work with additional details and new scaling studies run 24x further than earlier studies. Results are shown for two benchmarks executing workloads representative of typical Uintah applications. These results demonstrate single-source portability across the DOE Summit and NSF Frontera systems with good strong-scaling characteristics. The challenge of extending this approach to anticipated exascale systems is also considered.

2021

A. Dubey, M. Berzins, C. Burstedde, M.l L. Norman, D. Unat, M. Wahib. “Structured Adaptive Mesh Refinement Adaptations to Retain Performance Portability With Increasing Heterogeneity,” In Computing in Science & Engineering, Vol. 23, No. 5, pp. 62-66. 2021.
ISSN: 1521-9615
DOI: 10.1109/MCSE.2021.3099603

ABSTRACT

Adaptive mesh refinement (AMR) is an important method that enables many mesh-based applications to run at effectively higher resolution within limited computing resources by allowing high resolution only where really needed. This advantage comes at a cost, however: greater complexity in the mesh management machinery and challenges with load distribution. With the current trend of increasing heterogeneity in hardware architecture, AMR presents an orthogonal axis of complexity. The usual techniques, such as asynchronous communication and hierarchy management for parallelism and memory that are necessary to obtain reasonable performance are very challenging to reason about with AMR. Different groups working with AMR are bringing different approaches to this challenge. Here, we examine the design choices of several AMR codes and also the degree to which demands placed on them by their users influence these choices.

J. K. Holmen, D. Sahasrabudhe, M. Berzins. “A Heterogeneous MPI+PPL Task Scheduling Approach for Asynchronous Many-Task Runtime Systems,” In Proceedings of the Practice and Experience in Advanced Research Computing 2021 on Sustainability, Success and Impact (PEARC21), ACM, 2021.

ABSTRACT

Asynchronous many-task runtime systems and MPI+X hybrid parallelism approaches have shown promise for helping manage the increasing complexity of nodes in current and emerging high performance computing (HPC) systems, including those for exascale. The increasing architectural diversity, however, poses challenges for large legacy runtime systems emphasizing broad support for major HPC systems. Performance portability layers (PPL) have shown promise for helping manage this diversity. This paper describes a heterogeneous MPI+PPL task scheduling approach for combining these promising solutions with additional consideration for parallel third party libraries facing similar challenges to help prepare such a runtime for the diverse heterogeneous systems accompanying exascale computing. This approach is demonstrated using a heterogeneous MPI+Kokkos task scheduler and the accompanying portable abstractions [15] implemented in the Uintah Computational Framework, an asynchronous many-task runtime system, with additional consideration for hypre, a parallel third party library. Results are shown for two challenging problems executing workloads representative of typical Uintah applications. These results show performance improvements up to 4.4x when using this scheduler and the accompanying portable abstractions [15] to port a previously MPI-Only problem to Kokkos::OpenMP and Kokkos::CUDA to improve multi-socket, multi-device node use. Good strong-scaling to 1,024 NVIDIA V100 GPUs and 512 IBM POWER9 processor are also shown using MPI+Kokkos::OpenMP+Kokkos::CUDA at scale.

J. K. Holmen, D. Sahasrabudhe, M. Berzins, A. Bardakoff, T. J. Blattner, . Keyrouz. “Uintah+Hedgehog: Combining Parallelism Models for End-to-End Large-Scale Simulation Performance,” Scientific Computing and Imaging Institute, 2021.

ABSTRACT

The complexity of heterogeneous nodes near and at exascale has increased the need for “heroic” programming efforts. To accommodate this complexity, significant investment is required for codes not yet optimizing for low-level architecture features (e.g., wide vector units) and/or running at large-scale. This paper describes ongoing efforts to combine two codes, Hedgehog and Uintah, lying at both extremes to ease programming efforts. The end goals of this effort are (1) to combine the two codes to make an asynchronous many-task runtime system specializing in both node-level and large-scale performance and (2) to further improve the accessibility of both with portable abstractions. A prototype adopting Hedgehog in Uintah and a prototype extending Hedgehog to support MPI+X hybrid parallelism are discussed. Results achieving ∼60% of NVIDIA V100 GPU peak performance for a distributed DGEMM problem are shown for a naive MPI+Hedgehog implementation before any attempt to optimize for performance.

Authors note: This is a refereed but unpublished report that was
submitted to, reviewed for and accepted in revised form for a presentation of the same material at the Hipar Workshop at Supercomputing 21

C. R. Johnson. “Translational computer science at the scientific computing and imaging institute,” In Journal of Computational Science, Vol. 52, pp. 101217. 2021.
ISSN: 1877-7503
DOI: https://doi.org/10.1016/j.jocs.2020.101217

ABSTRACT

The Scientific Computing and Imaging (SCI) Institute at the University of Utah evolved from the SCI research group, started in 1994 by Professors Chris Johnson and Rob MacLeod. Over time, research centers funded by the National Institutes of Health, Department of Energy, and State of Utah significantly spurred growth, and SCI became a permanent interdisciplinary research institute in 2000. The SCI Institute is now home to more than 150 faculty, students, and staff. The history of the SCI Institute is underpinned by a culture of multidisciplinary, collaborative research, which led to its emergence as an internationally recognized leader in the development and use of visualization, scientific computing, and image analysis research to solve important problems in a broad range of domains in biomedicine, science, and engineering. A particular hallmark of SCI Institute research is the creation of open source software systems, including the SCIRun scientific problem-solving environment, Seg3D, ImageVis3D, Uintah, ViSUS, Nektar++, VisTrails, FluoRender, and FEBio. At this point, the SCI Institute has made more than 50 software packages broadly available to the scientific community under open-source licensing and supports them through web pages, documentation, and user groups. While the vast majority of academic research software is written and maintained by graduate students, the SCI Institute employs several professional software developers to help create, maintain, and document robust, tested, well-engineered open source software. The story of how and why we worked, and often struggled, to make professional software engineers an integral part of an academic research institute is crucial to the larger story of the SCI Institute’s success in translational computer science (TCS).

Damodar Sahasrabudhe. “Enhancing Asynchronous Many-Task Runtime Systems for Next-Generation Architectures and Exascale Supercomputers,” School of Computing, University of Utah, Salt Lake City, UT, USA, 2021.

ABSTRACT

Exascale supercomputers capable of computing 10¹⁸ double-precision floating point operations per second are expected to be operational around 2022/23. The complexity and diversity of the proposed exascale machines pose new challenges for the software applications, namely, 1) implementing efficient data management; 2) having programming systems to exploit locality and multimillion parallelism; 3) developing efficient algorithms to leverage new architectures; 4) ensuring resiliency; and 5) improving scientific productivity on diverse architectures. Due to data-driven scheduling and asynchronous execution, Asynchronous Many-Task (AMT) runtime systems show promise to handle these exascale challenges.

One such AMT, the Uintah Computational Framework, maintains two distinct layers for the application and underlying runtime infrastructure. This distinction allows Uintah users to concentrate on application and the Uintah infrastructure handles communication, data coherency, multithreading, and architecture-specific complexities.

This dissertation addresses some of the exascale challenges and also integrates the individual solutions under the single umbrella of Uintah. The resiliency approach handles node failure faster than the traditional checkpointing method and helps to address challenge (4). A potential solution for challenges (2) and (3) can be the new asynchronous scheduler designed for the Sunway Taihulight supercomputer that shows the benefits of asynchronous execution. The novel portable Single Instruction Multiple Data (SIMD) primitive provides a prospective approach to handle (2) and (5), which achieves near-ideal vectorization on Central Processing Units (CPUs) along with Graphics Processing Unit (GPU) portability provided by the CUDA back end. The newly developed threading model using MPI endpoints shows performance improvements over the MPI-everywhere version, which can be one of the solutions to tackle challenges (2) and (3). Finally, this work enhances the heterogeneous scheduler, contributes to the ongoing portability drive, and successfully runs a simulation using portable AMT tasks on thousands of CPUs and GPUs. These enhancements are important to answer challenges (2), (3), and (5). As a result, this research takes Uintah closer to exascale readiness. Using Uintah as an example, this work demonstrates how AMTs, third-party libraries, and applications can be enhanced to benefit from the next-generation architectures.

W. T. Sołowski, M. Berzins, W. Coombs, J. Guilkey, M. Möller, Q. A. Tran, T. Adibaskoro, S. Seyedan, R. Tielen, K. Soga. “Material point method: Overview and challenges ahead (with videos),” In Advances in Applied Mechanics, 1, Vol. 14, Ch. 2, Elsevier, pp. 113-204. 2021.
ISBN: 978-0-323-88519-5

ABSTRACT

The paper gives an overview of Material Point Method and shows its evolution over the last 25 years. The Material Point Method developments followed a logical order. The article aims at identifying this order and show not only the current state of the art, but explain the drivers behind the developments and identify what is currently still missing.The paper explores modern implementations of both explicit and implicit Material Point Method. It concentrates mainly on uses of the method in engineering, but also gives a short overview of Material Point Method application in computer graphics and animation. Furthermore, the article gives overview of errors in the material point method algorithms, as well as identify gaps in knowledge, filling which would hopefully lead to a much more efficient and accurate Material Point Method. The paper also briefly discusses algorithms related to contact and boundaries, coupling the Material Point Method with other numerical methods and modeling of fractures. It also gives an overview of modeling of multi-phase continua with Material Point Method. The paper closes with numerical examples, aiming at showing the capabilities of Material Point Method in advanced simulations. Those include landslide modeling, multiphysics simulation of shaped charge explosion and simulations of granular material flow out of a silo undergoing changes from continuous to discontinuous and back to continuous behavior.The paper uniquely illustrates many of the developments not only with figures but also with videos, giving the whole extend of simulation instead of just a timestamped image

W. T. Sołowski, M. Berzins, W. Coombs, J. Guilkey, M. Möller, Q. A. Tran, T. Adibaskoro, S. Seyedan, R. Tielen, K. Soga. “Material point method: Overview and challenges ahead (without videos),” In Advances in Applied Mechanics, 1, Vol. 14, Ch. 2, Elsevier, pp. 113-204. 2021.

ABSTRACT

R. Zambre, D. Sahasrabudhe, H. Zhou, M. Berzins, A. Chandramowlishwaran, P. Balaji. “Logically Parallel Communication for Fast MPI+Threads Communication,” In Proceedings of the Transactions on Parallel and Distributed Computing, IEEE, April, 2021.

ABSTRACT

Supercomputing applications are increasingly adopting the MPI+threads programming model over the traditional “MPI everywhere” approach to better handle the disproportionate increase in the number of cores compared with other on-node resources. In practice, however, most applications observe a slower performance with MPI+threads primarily because of poor communication performance. Recent research efforts on MPI libraries address this bottleneck by mapping logically parallel communication, that is, operations that are not subject to MPI’s ordering constraints to the underlying network parallelism. Domain scientists, however, typically do not expose such communication independence information because the existing MPI-3.1 standard’s semantics can be limiting. Researchers had initially proposed user-visible endpoints to combat this issue, but such a solution requires intrusive changes to the standard (new APIs). The upcoming MPI-4.0 standard, on the other hand, allows applications to relax unneeded semantics and provides them with many opportunities to express logical communication parallelism. In this paper, we show how MPI+threads applications can achieve high performance with logically parallel communication. Through application case studies, we compare the capabilities of the new MPI-4.0 standard with those of the existing one and user-visible endpoints (upper bound). Logical communication parallelism can boost the overall performance of an application by over 2x.

2020

D. Sahasrabudhe, M. Berzins. “Improving Performance of the Hypre Iterative Solver for Uintah Combustion Codes on Manycore Architectures Using MPI Endpoints and Kernel Consolidation,” In Computational Science -- ICCS 2020, 20th International Conference, Amsterdam, The Netherlands, June 3–5, 2020, Proceedings, Part I, Springer International Publishing, pp. 175--190. 2020.
ISBN: 978-3-030-50371-0

ABSTRACT

The solution of large-scale combustion problems with codes such as the Arches component of Uintah on next generation computer architectures requires the use of a many and multi-core threaded approach and/or GPUs to achieve performance. Such codes often use a low-Mach number approximation, that require the iterative solution of a large system of linear equations at every time step. While the discretization routines in such a code can be improved by the use of, say, OpenMP or Cuda Approaches, it is important that the linear solver be able to perform well too. For Uintah the Hypre iterative solver has proved to solve such systems in a scalable way. The use of Hypre with OpenMP leads to at least 2x slowdowns due to OpenMP overheads, however. This behavior is analyzed and a solution proposed by using the MPI Endpoints approach is implemented within Hypre, where each team of threads acts as a different MPI rank. This approach minimized OpenMP synchronization overhead, avoided slowdowns, performed as fast or (up to 1.5x) faster than Hypre’s MPI only version, and allowed the rest of Uintah to be optimized using OpenMP. Profiling of the GPU version of Hypre showed the bottleneck to be the launch overhead of thousands of micro-kernels. The GPU performance was improved by fusing these micro kernels and was further optimized by using Cuda-aware MPI. The overall speedup of 1.26x to 1.44x was observed compared to the baseline GPU implementation.

D. Sahasrabudhe, R. Zambre, A. Chandramowlishwaran, M. Berzins. “Optimizing the Hypre solver for manycore and GPU architectures,” In Journal of Computational Science, Springer International Publishing, pp. 101279. 2020.
ISBN: 978-3-030-50371-0
ISSN: 1877-7503
DOI: https://doi.org/10.1016/j.jocs.2020.101279

ABSTRACT

The solution of large-scale combustion problems with codes such as Uintah on modern computer architectures requires the use of multithreading and GPUs to achieve performance. Uintah uses a low-Mach number approximation that requires iteratively solving a large system of linear equations. The Hypre iterative solver has solved such systems in a scalable way for Uintah, but the use of OpenMP with Hypre leads to at least 2x slowdown due to OpenMP overheads. The proposed solution uses the MPI Endpoints within Hypre, where each team of threads acts as a different MPI rank. This approach minimizes OpenMP synchronization overhead and performs as fast or (up to 1.44x) faster than Hypre’s MPI-only version, and allows the rest of Uintah to be optimized using OpenMP. The profiling of the GPU version of Hypre shows the bottleneck to be the launch overhead of thousands of micro-kernels. The GPU performance was improved by fusing these micro-kernels and was further optimized by using Cuda-aware MPI, resulting in an overall speedup of 1.16–1.44x compared to the baseline GPU implementation.

The above optimization strategies were published in the International Conference on Computational Science 2020. This work extends the previously published research by carrying out the second phase of communication-centered optimizations in Hypre to improve its scalability on large-scale supercomputers. This includes an efficient non-blocking inter-thread communication scheme, communication-reducing patch assignment, and expression of logical communication parallelism to a new version of the MPICH library that utilizes the underlying network parallelism. The above optimizations avoid communication bottlenecks previously observed during strong scaling and improve performance by up to 2x on 256 nodes of Intel Knight’s Landing processor.

2019

J. K. Holmen, B. Peterson, A. Humphrey, D. Sunderland, O. H. Diaz-Ibarra, J. N. Thornock, M. Berzins. “Portably Improving Uintah's Readiness for Exascale Systems Through the Use of Kokkos,” SCI Institute, 2019.

ABSTRACT

Uncertainty and diversity in future HPC systems, including those for exascale, makes portable codebases desirable. To ease future ports, the Uintah Computational Framework has adopted the Kokkos C++ Performance Portability Library. This paper describes infrastructure advancements and performance improvements using partitioning functionality recently added to Kokkos within Uintah's MPI+Kokkos hybrid parallelism approach. Results are presented for two challenging calculations that have been refactored to support Kokkos::OpenMP and Kokkos::Cuda back-ends. These results demonstrate performance improvements up to (i) 2.66x when refactoring for portability, (ii) 81.59x when adding loop-level parallelism via Kokkos back-ends, and (iii) 2.63x when more eciently using a node. Good strong-scaling characteristics to 442,368 threads across 1728 Knights Landing processors are also shown. These improvements have been achieved with little added overhead (sub-millisecond, consuming up to 0.18% of per-timestep time). Kokkos adoption and refactoring lessons are also discussed.

J. K. Holmen, B. Peterson, M. Berzins. “An Approach for Indirectly Adopting a Performance Portability Layer in Large Legacy Codes,” In 2nd International Workshop on Performance, Portability, and Productivity in HPC (P3HPC), In conjunction with SC19, 2019.

ABSTRACT

Diversity among supported architectures in current and emerging high performance computing systems, including those for exascale, makes portable codebases desirable. Portability of a codebase can be improved using a performance portability layer to provide access to multiple underlying programming models through a single interface. Direct adoption of a performance portability layer, however, poses challenges for large pre-existing software frameworks that may need to preserve legacy code and/or adopt other programming models in the future. This paper describes an approach for indirect adoption that introduces a framework-specific portability layer between the application developer and the adopted performance portability layer to help improve legacy code support and long-term portability for future architectures and programming models. This intermediate layer uses loop-level, application-level, and build-level components to ease adoption of a performance portability layer in large legacy codebases. Results are shown for two challenging case studies using this approach to make portable use of OpenMP and CUDA via Kokkos in an asynchronous many-task runtime system, Uintah. These results show performance improvements up to 2.7x when refactoring for portability and 2.6x when more efficiently using a node. Good strong-scaling to 442,368 threads across 1,728 Knights Landing processors are also shown using MPI+Kokkos at scale.

Alan Humphrey. “Scalable Asynchronous Many-Task Runtime Solutions to Globally Coupled Problems,” School of Computing, University of Utah, 2019.

ABSTRACT

Thermal radiation is an important physical process and a key mechanism in a class of challenging engineering and research problems. The principal exascale-candidate application motivating this research is a large eddy simulation (LES) aimed at predicting the performance of a commercial, 1200 MWe ultra-super critical (USC) coal boiler, with radiation as the dominant mode of heat transfer. Scalable modeling of radiation is currently one of the most challenging problems in large-scale simulations, due to the global, all-to-all physical and resulting computational connectivity. Fundamentally, radiation models impose global data dependencies, requiring each compute node in a distributed memory system to send data to, and receive data from, potentially every other node. This process can be prohibitively expensive on large distributed memory systems due to pervasive all-to-all message passing interface (MPI) communication. Correctness is also difficult to achieve when coordinating global communication of this kind. Asynchronous many-task (AMT) runtime systems are a possible leading alternative to mitigate programming challenges at the runtime system-level, sheltering the application developer from the complexities introduced by future architectures. However, large-scale parallel applications with complex global data dependencies, such as in radiation modeling, pose significant scalability challenges themselves, even for a highly tuned AMT runtime. The principal aims of this research are to demonstrate how the Uintah AMT runtime can be adapted, making it possible for complex multiphysics applications with radiation to scale on current petascale and emerging exascale architectures. For Uintah, which uses a directed acyclic graph to represent the computation and associated data dependencies, these aims are achieved through: 1) the use of an AMT runtime; 2) adapting and leveraging Uintah’s adaptive mesh refinement support to dramatically reduce computation, communication volume, and nodal memory footprint for radiation calculations; and 3) automating the all-to-all communication at the runtime level through a task graph dependency analysis phase designed to efficiently manage data dependencies inherent in globally coupled problems.

A. Humphrey, M. Berzins. “An Evaluation of An Asynchronous Task Based Dataflow Approach For Uintah,” In 2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC), Vol. 2, pp. 652-657. July, 2019.
ISSN: 0730-3157
DOI: 10.1109/COMPSAC.2019.10282

ABSTRACT

The challenge of running complex physics code on the largest computers available has led to dataflow paradigms being explored. While such approaches are often applied at smaller scales, the challenge of extreme-scale data flow computing remains. The Uintah dataflow framework has consistently used dataflow computing at the largest scales on complex physics applications. At present Uintah contains two main dataflow models. Both are based upon asynchronous communication. One uses a static graph-based approach with asynchronous communication and the other uses a more dynamic approach that was introduced almost a decade ago. Subsequent changes within the Uintah runtime system combined with many more large scale experiments, has necessitated a reevaluation of these two approaches, comparing them in the context of large scale problems. While the static approach has worked well for some large-scale simulations, the dynamic approach is seen to offer performance improvements over the static case for a challenging fluid-structure interaction problem at large scale that involves fluid flow and a moving solid represented using particle method on an adaptive mesh.

B. Peterson. “Portable and Performant GPU/Heterogeneous Asynchronous Many-task Runtime System,” Subtitled “Ph.D. Dissertation,” University of Utah, School of Computing, Dec, 2019.

ABSTRACT

Asynchronous many-task (AMT) runtimes are maturing as a model for computing simulations on a diverse range of architectures at large-scale. The Uintah AMT framework is driven by a philosophy of maintaining an application layer distinct from the underlying runtime while operating on an adaptive mesh grid. This model has enabled task devel-opers to focus on writing task code while minimizing their interaction with MPI transfers, halo processing, data stores, coherency of simulation variables, and proper ordering of task execution. Further, Uintah is implementing an architecture portable solution by utilizing the Kokkos programming portability layer so that application tasks can be written in one codebase and performantly executed on CPUs, GPUs, Intel Xeon Phis, and other future architectures.

Of these architectures, it is perhaps Nvidia GPUs that introduce the greatest usability and portability challenges for AMT runtimes. Specifically, Nvidia GPUs require code to adhere to a proprietary programming model, use separate high capacity memory, utilize asynchrony of data movement and execution, and partition execution units among many streaming multiprocessors. Numerous novel solutions to both Uintah and Kokkos are required to abstract these GPU features into an AMT runtime while preserving an appli-cation layer and enabling portability.

The focus of this AMT research is largely split into two main parts, performance and portability. Runtime performance comes from 1) minimizing runtime overhead when preparing simulation variables for tasks prior to execution, and 2) executing a hetero-geneous mixture of tasks to keep compute node processing units busy. Preparation of simulation variables, especially halo processing, receives significant emphasis as Uintah’s target problems heavily rely on local and global halos. In addition, this work covers automated data movement of simulation variables between host and GPU memory as well as distributing tasks throughout a GPU for execution.

Portability is a productivity necessity as application developers struggle to maintain three sets of code per task, namely code for single CPU core execution, CUDA code for GPU tasks, and a third set of code for Xeon Phi parallel execution. Programming portability layers, such as Kokkos, provide a framework for this portability, however, Kokkos itself requires modifications to support GPU execution of finer grained tasks typical of AMT runtimes like Uintah. Currently, Kokkos GPU parallel loop execution is bulk-synchronous. This research demonstrates a model for portable loops that is asynchronous, nonblocking, and performant. Additionally, integrating GPU portability into Uintah required additional modifications to aid the application developer in avoiding Kokkos specific details.

This research concludes by demonstrating a GPU-enabled AMT runtime that is both performant and portable. Further, application developers are not burdened with additional architecture specific requirements. Results are demonstrated using production task codebases written for CPUs, GPUs, and Kokkos portability and executed in GPU homogeneous and CPU/GPU heterogeneous environments.

D. Sahasrabudhe, M. Berzins, J. Schmidt. “Node failure resiliency for Uintah without checkpointing,” In Concurrency and Computation: Practice and Experience, pp. e5340. 2019.
DOI: doi:10.1002/cpe.5340

ABSTRACT

The frequency of failures in upcoming exascale supercomputers may well be greater than at present due to many-core architectures if component failure rates remain unchanged. This potential increase in failure frequency coupled with I/O challenges at exascale may prove problematic for current resiliency approaches such as checkpoint restarting, although the use of fast intermediate memory may help. Algorithm-Based Fault Tolerance (ABFT) using Adaptive Mesh Refinement (AMR) is one resiliency approach used to address these challenges. For adaptive mesh codes, a coarse mesh version of the solution may be used to restore the fine mesh solution. This paper addresses the implementation of the ABFT approach within the Uintah software framework: both at a software level within Uintah and in the data reconstruction method used for the recovery of lost data. This method has two problems: inaccuracies introduced during the reconstruction propagate forward in time, and the physical consistency of variables such as positivity or boundedness may be violated during interpolation. These challenges can be addressed by the combination of two techniques: 1. a fault-tolerant MPI implementation to recover from runtime node failures, and 2. high-order interpolation schemes to preserve the physical solution and reconstruct lost data. The approach considered here uses a "Limited Essentially Non-Oscillatory" (LENO) scheme along with AMR to rebuild the lost data without checkpointing using Uintah. Experiments were carried out using a fault-tolerant MPI - ULFM to recover from runtime failure, and LENO to recover data on patches belonging to failed ranks, while the simulation was continued to the end. Results show that this ABFT approach is up to 10x faster than the traditional checkpointing method. The new interpolation approach is more accurate than linear interpolation and not subject to the overshoots found in other interpolation methods.

Page 1 of 15

SCIENTIFIC COMPUTING AND IMAGING INSTITUTEat the University of Utah

SCI Publications

SCIENTIFIC COMPUTING AND IMAGING INSTITUTE
at the University of Utah