Laplacian smoothing stochastic gradient markov chain monte carlo|
B. Wang, D. Zou, Q. Gu, S. J. Osher. In SIAM Journal on Scientific Computing, Vol. 43, No. 1, SIAM, pp. A26-A53. 2021.
As an important Markov chain Monte Carlo (MCMC) method, the stochastic gradient Langevin dynamics (SGLD) algorithm has achieved great success in Bayesian learning and posterior sampling. However, SGLD typically suffers from a slow convergence rate due to its large variance caused by the stochastic gradient. In order to alleviate these drawbacks, we leverage the recently developed Laplacian smoothing technique and propose a Laplacian smoothing stochastic gradient Langevin dynamics (LS-SGLD) algorithm. We prove that for sampling from both log-concave and non-log-concave densities, LS-SGLD achieves strictly smaller discretization error in 2-Wasserstein distance, although its mixing rate can be slightly slower. Experiments on both synthetic and real datasets verify our theoretical results and demonstrate the superior performance of LS-SGLD on different machine learning tasks including posterior …
Stability and Generalization of the Decentralized Stochastic Gradient Descent|
Subtitled arXiv preprint arXiv:2102.01302, T. Sun, D. Li, B. Wang. 2021.
The stability and generalization of stochastic gradient-based methods provide valuable insights into understanding the algorithmic performance of machine learning models. As the main workhorse for deep learning, stochastic gradient descent has received a considerable amount of studies. Nevertheless, the community paid little attention to its decentralized variants. In this paper, we provide a novel formulation of the decentralized stochastic gradient descent. Leveraging this formulation together with (non) convex optimization theory, we establish the first stability and generalization guarantees for the decentralized stochastic gradient descent. Our theoretical results are built on top of a few common and mild assumptions and reveal that the decentralization deteriorates the stability of SGD for the first time. We verify our theoretical findings by using a variety of decentralized settings and benchmark machine learning models.
Robust Certification for Laplace Learning on Geometric Graphs|
Subtitled arXiv preprint arXiv:2104.10837, M. Thorpe, B. Wang. 2021.
Graph Laplacian (GL)-based semi-supervised learning is one of the most used approaches for classifying nodes in a graph. Understanding and certifying the adversarial robustness of machine learning (ML) algorithms has attracted large amounts of attention from different research communities due to its crucial importance in many security-critical applied domains. There is great interest in the theoretical certification of adversarial robustness for popular ML algorithms. In this paper, we provide the first adversarial robust certification for the GL classifier. More precisely we quantitatively bound the difference in the classification accuracy of the GL classifier before and after an adversarial attack. Numerically, we validate our theoretical certification results and show that leveraging existing adversarial defenses for the -nearest neighbor classifier can remarkably improve the robustness of the GL classifier.
Decentralized Federated Averaging|
Subtitled arXiv preprint arXiv:2104.11375, T. Sun, D. Li, B. Wang. 2021.
Federated averaging (FedAvg) is a communication efficient algorithm for the distributed training with an enormous number of clients. In FedAvg, clients keep their data locally for privacy protection; a central parameter server is used to communicate between clients. This central server distributes the parameters to each client and collects the updated parameters from clients. FedAvg is mostly studied in centralized fashions, which requires massive communication between server and clients in each communication. Moreover, attacking the central server can break the whole system's privacy. In this paper, we study the decentralized FedAvg with momentum (DFedAvgM), which is implemented on clients that are connected by an undirected graph. In DFedAvgM, all clients perform stochastic gradient descent with momentum and communicate with their neighbors only. To further reduce the communication cost, we also consider the quantized DFedAvgM. We prove convergence of the (quantized) DFedAvgM under trivial assumptions; the convergence rate can be improved when the loss function satisfies the P\L property. Finally, we numerically verify the efficacy of DFedAvgM.
Leveraging 31 Million Google Street View Images to Characterize Built Environments and Examine County Health Outcomes |
Q. C Nguyen, J. M. Keralis, P. Dwivedi, A. E. Ng, M. Javanmardi, S. Khanna, Y. Huang, K. D. Brunisholz, A. Kumar, T. Tasdizen. In Public Health Reports, Vol. 136, No. 2, SAGE Publications, pp. 201-211. 2021.
MethodsWe leveraged computer vision and Google Street View images accessed from December 15, 2017, through July 17, 2018, to detect features of the built environment (presence of a crosswalk, non–single-family home, single-lane roads, and visible utility wires) for 2916 US counties. We used multivariate linear regression models to determine associations between features of the built environment and county-level health outcomes (prevalence of adult obesity, prevalence of diabetes, physical inactivity, frequent physical and mental distress, poor or fair self-rated health, and premature death [in years of potential life lost]).
ResultsCompared with counties with the least number of crosswalks, counties with the most crosswalks were associated with decreases of 1.3%, 2.7%, and 1.3% of adult obesity, physical inactivity, and fair or poor self-rated health, respectively, and 477 fewer years of potential life lost before age 75 (per 100 000 population). The presence of non–single-family homes was associated with lower levels of all health outcomes except for premature death. The presence of single-lane roads was associated with an increase in physical inactivity, frequent physical distress, and fair or poor self-rated health. Visible utility wires were associated with increases in adult obesity, diabetes, physical and mental distress, and fair or poor self-rated health.
ConclusionsThe use of computer vision and big data image sources makes possible national studies of the built environm
Understanding a program's resiliency through error propagation|
Z. Li, H. Menon, K. Mohror, P. T. Bremer, Y. Livant, V. Pascucci. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ACM, pp. 362-373. 2021.
Aggressive technology scaling trends have worsened the transient fault problem in high-performance computing (HPC) systems. Some faults are benign, but others can lead to silent data corruption (SDC), which represents a serious problem; a fault introducing an error that is not readily detected nto an HPC simulation. Due to the insidious nature of SDCs, researchers have worked to understand their impact on applications. Previous studies have relied on expensive fault injection campaigns with uniform sampling to provide overall SDC rates, but this solution does not provide any feedback on the code regions without samples.
Blueprint: Cyberinfrastructure Center of Excellence|
Subtitled arXiv, E. Deelman, A. Mandal, A. P. Murillo, J. Nabrzyski, V. Pascucci, R. Ricci, I. Baldin, S. Sons, L. Christopherson, C. Vardeman, R. F. da Silva, J. Wyngaard, S. Petruzza, M. Rynge, K. Vahi, W. R. Whitcup, J. Drake, E. Scott. 2021.
In 2018, NSF funded an effort to pilot a Cyberinfrastructure Center of Excellence (CI CoE or Center) that would serve the cyberinfrastructure (CI) needs of the NSF Major Facilities (MFs) and large projects with advanced CI architectures. The goal of the CI CoE Pilot project (Pilot) effort was to develop a model and a blueprint for such a CoE by engaging with the MFs, understanding their CI needs, understanding the contributions the MFs are making to the CI community, and exploring opportunities for building a broader CI community. This document summarizes the results of community engagements conducted during the first two years of the project and describes the identified CI needs of the MFs. To better understand MFs' CI, the Pilot has developed and validated a model of the MF data lifecycle that follows the data generation and management within a facility and gained an understanding of how this model captures the fundamental stages that the facilities' data passes through from the scientific instruments to the principal investigators and their teams, to the broader collaborations and the public. The Pilot also aimed to understand what CI workforce development challenges the MFs face while designing, constructing, and operating their CI and what solutions they are exploring and adopting within their projects. Based on the needs of the MFs in the data lifecycle and workforce development areas, this document outlines a blueprint for a CI CoE that will learn about and share the CI solutions designed, developed, and/or adopted by the MFs, provide expertise to the largest NSF projects with advanced and complex CI architectures, and foster a …
Symplectic Time Integration Methods for the Material Point Method, Experiments, Analysis and Order Reduction|
M. Berzins. In WCCM-ECCOMAS2020 virtual Conference, January, 2021.
The provision of appropriate time integration methods for the Material Point Method (MPM) involves considering stability, accuracy and energy conservation. A class of methods that addresses many of these issues are the widely-used symplectic time integration methods. Such methods have good conservation properties and have the potential to achieve high accuracy. In this work we build on the work in  and consider high order methods for the time integration of the Material Point Method. The results of practical experiments show that while high order methods in both space and time have good accuracy initially, unless the problem has relatively little particle movement then the accuracy of the methods for later time is closer to that of low order methods. A theoretical analysis explains these results as being similar to the stage error found in Runge Kutta methods, though in this case the stage error arises from the MPM differentiations and interpolations from particles to grid and back again, particularly in cases in which there are many grid crossings.
A Heterogeneous MPI+PPL Task Scheduling Approach for Asynchronous Many-Task Runtime Systems|
J. K. Holmen, D. Sahasrabudhe, M. Berzins. In Proceedings of the Practice and Experience in Advanced Research Computing 2021 on Sustainability, Success and Impact (PEARC21), ACM, 2021.
Asynchronous many-task runtime systems and MPI+X hybrid parallelism approaches have shown promise for helping manage theincreasing complexity of nodes in current and emerging high performance computing (HPC) systems, including those for exascale. Theincreasing architectural diversity, however, poses challenges for large legacy runtime systems emphasizing broad support for majorHPC systems. Performance portability layers (PPL) have shown promise for helping manage this diversity. This paper describes aheterogeneous MPI+PPL task scheduling approach for combining these promising solutions with additional consideration for parallelthird party libraries facing similar challenges to help prepare such a runtime for the diverse heterogeneous systems accompanyingexascale computing. This approach is demonstrated using a heterogeneous MPI+Kokkos task scheduler and the accompanyingportable abstractions  implemented in the Uintah Computational Framework, an asynchronous many-task runtime system, withadditional consideration for hypre, a parallel third party library. Results are shown for two challenging problems executing workloadsrepresentative of typical Uintah applications. These results show performance improvements up to 4.4x when using this schedulerand the accompanying portable abstractions  to port a previously MPI-Only problem to Kokkos::OpenMP and Kokkos::CUDA toimprove multi-socket, multi-device node use. Good strong-scaling to 1,024 NVIDIA V100 GPUs and 512 IBM POWER9 processor arealso shown using MPI+Kokkos::OpenMP+Kokkos::CUDA at scale
Logically Parallel Communication for Fast MPI+Threads Communication|
R. Zambre, D. Sahasrabudhe, H. Zhou, M. Berzins, A. Chandramowlishwaran, P. Balaji. In Proceedings of the Transactions on Parallel and Distributed Computing, IEEE, April, 2021.
Supercomputing applications are increasingly adopting the MPI+threads programming model over the traditional “MPI everywhere” approach to better handle the disproportionate increase in the number of cores compared with other on-node resources. In practice, however, most applications observe a slower performance with MPI+threads primarily because of poor communication performance. Recent research efforts on MPI libraries address this bottleneck by mapping logically parallel communication, that is, operations that are not subject to MPI’s ordering constraints to the underlying network parallelism. Domain scientists, however, typically do not expose such communication independence information because the existing MPI-3.1 standard’s semantics can be limiting. Researchers had initially proposed user-visible endpoints to combat this issue, but such a solution requires intrusive changes to the standard (new APIs). The upcoming MPI-4.0 standard, on the other hand, allows applications to relax unneeded semantics and provides them with many opportunities to express logical communication parallelism. In this paper, we show how MPI+threads applications can achieve high performance with logically parallel communication. Through application case studies, we compare the capabilities of the new MPI-4.0 standard with those of the existing one and user-visible endpoints (upper bound). Logical communication parallelism can boost the overall performance of an application by over 2x.
Optimizing the Hypre solver for manycore and GPU architectures|
D. Sahasrabudhe, R. Zambre, A. Chandramowlishwaran, M. Berzins. In Journal of Computational Science, Springer International Publishing, pp. 101279. 2020.
The solution of large-scale combustion problems with codes such as Uintah on modern computer architectures requires the use of multithreading and GPUs to achieve performance. Uintah uses a low-Mach number approximation that requires iteratively solving a large system of linear equations. The Hypre iterative solver has solved such systems in a scalable way for Uintah, but the use of OpenMP with Hypre leads to at least 2x slowdown due to OpenMP overheads. The proposed solution uses the MPI Endpoints within Hypre, where each team of threads acts as a different MPI rank. This approach minimizes OpenMP synchronization overhead and performs as fast or (up to 1.44x) faster than Hypre’s MPI-only version, and allows the rest of Uintah to be optimized using OpenMP. The profiling of the GPU version of Hypre shows the bottleneck to be the launch overhead of thousands of micro-kernels. The GPU performance was improved by fusing these micro-kernels and was further optimized by using Cuda-aware MPI, resulting in an overall speedup of 1.16–1.44x compared to the baseline GPU implementation.
Report from the NSF Workshop on Smart Cyberinfrastructure 2020|
V. Pascucci, I. Altintas, J. Fortes, I. Foster, H. Gu, S. Hariri, D. Stanzione, M. Taufer, X. Zhao. NSF, 2020.
Machine learning and other Artifical Intelligenece technologies (all indicated in the following as AI) used within a modern, smart cyberinfrastructure have become critical new avenues for discovery and validation in data-driven science and engineering disciplines of all kinds. We can expect many landmark discoveries and new lines of productive research to be enabled through AI analysis of the rapidly growing treasure trove of scientific data. AI-based techniques have been applied in many fields of science and engineering, including remote sensing, cosmology, energy, cancer research, IT systems management, and machine design and control, but the lack of proper integration with the current NSF-supported cyberinfrastructure is limiting their potential. Recent events due to the COVID-19 pandemic have highlighted how cyberinfrastructure is a crucial enabler of modern research, with massive simulations and data management capabilities [8-10], but these events have also emphasized how the lack of proper integration with AI technology remains a major limiting factor for the advancement of science and engineering, especially when any kind of rapid response is needed.
A Terminology for In Situ Visualization and Analysis Systems|
H. Childs, S. D. Ahern, J. Ahrens, A. C. Bauer, J. Bennett, E. W. Bethel, P. Bremer, E. Brugger, J. Cottam, M. Dorier, S. Dutta, J. M. Favre, T. Fogal, S. Frey, C. Garth, B. Geveci, W. F. Godoy, C. D. Hansen, C. Harrison, B. Hentschel, J. Insley, C. R. Johnson, S. Klasky, A. Knoll, J. Kress, M. Larsen, J. Lofstead, K. Ma, P. Malakar, J. Meredith, K. Moreland, P. Navratil, P. O’Leary, M. Parashar, V. Pascucci, J. Patchett, T. Peterka, S. Petruzza, N. Podhorszki, D. Pugmire, M. Rasquin, S. Rizzi, D. H. Rogers, S. Sane, F. Sauer, R. Sisneros, H. Shen, W. Usher, R. Vickery, V. Vishwanath, I. Wald, R. Wang, G. H. Weber, B. Whitlock, M. Wolf, H. Yu, S. B. Ziegeler. In International Journal of High Performance Computing Applications, Vol. 34, No. 6, pp. 676–691. 2020.
The term “in situ processing” has evolved over the last decade to mean both a specific strategy for visualizing and analyzing data and an umbrella term for a processing paradigm. The resulting confusion makes it difficult for visualization and analysis scientists to communicate with each other and with their stakeholders. To address this problem, a group of over fifty experts convened with the goal of standardizing terminology. This paper summarizes their findings and proposes a new terminology for describing in situ systems. An important finding from this group was that in situ systems are best described via multiple, distinct axes: integration type, proximity, access, division of execution, operation controls, and output type. This paper discusses these axes, evaluates existing systems within the axes, and explores how currently used terms relate to the axes.
Numerical Testing of a New Positivity-Preserving Interpolation Algorithm|
Subtitled arXiv, T. A. J. Ouermi, R. M. Kirby, M. Berzins. 2020.
An important component of a number of computational modeling algorithms is an interpolation method that preserves the positivity of the function being interpolated. This report describes the numerical testing of a new positivity-preserving algorithm that is designed to be used when interpolating from a solution defined on one grid to different spatial grid. The motivating application is a numerical weather prediction (NWP) code that uses spectral elements as the discretization choice for its dynamics core and Cartesian product meshes for the evaluation of its physics routines. This combination of spectral elements, which use nonuniformly spaced quadrature/collocation points, and uniformly-spaced Cartesian meshes combined with the desire to maintain positivity when moving between these necessitates our work. This new approach is evaluated against several typical algorithms in use on a range of test problems in one or more space dimensions. The results obtained show that the new method is competitive in terms of observed accuracy while at the same time preserving the underlying positivity of the functions being interpolated.
Improving Performance of the Hypre Iterative Solver for Uintah Combustion Codes on Manycore Architectures Using MPI Endpoints and Kernel Consolidation|
D. Sahasrabudhe, M. Berzins. In Computational Science -- ICCS 2020, 20th International Conference, Amsterdam, The Netherlands, June 3–5, 2020, Proceedings, Part I, Springer International Publishing, pp. 175--190. 2020.
The solution of large-scale combustion problems with codes such as the Arches component of Uintah on next generation computer architectures requires the use of a many and multi-core threaded approach and/or GPUs to achieve performance. Such codes often use a low-Mach number approximation, that require the iterative solution of a large system of linear equations at every time step. While the discretization routines in such a code can be improved by the use of, say, OpenMP or Cuda Approaches, it is important that the linear solver be able to perform well too. For Uintah the Hypre iterative solver has proved to solve such systems in a scalable way. The use of Hypre with OpenMP leads to at least 2x slowdowns due to OpenMP overheads, however. This behavior is analyzed and a solution proposed by using the MPI Endpoints approach is implemented within Hypre, where each team of threads acts as a different MPI rank. This approach minimized OpenMP synchronization overhead, avoided slowdowns, performed as fast or (up to 1.5x) faster than Hypre’s MPI only version, and allowed the rest of Uintah to be optimized using OpenMP. Profiling of the GPU version of Hypre showed the bottleneck to be the launch overhead of thousands of micro-kernels. The GPU performance was improved by fusing these micro kernels and was further optimized by using Cuda-aware MPI. The overall speedup of 1.26x to 1.44x was observed compared to the baseline GPU implementation.
Distributed Resources for the Earth System Grid Advanced Management (DREAM), Final Report|
L. Cinquini, S. Petruzza, Jason J. Boutte, S. Ames, G. Abdulla, V. Balaji, R. Ferraro, A. Radhakrishnan, L. Carriere, T. Maxwell, G. Scorzelli, V. Pascucci. 2020.
The DREAM project was funded more than 3 years ago to design and implement a next-generation ESGF (Earth System Grid Federation ) architecture which would be suitable for managing and accessing data and services resources on a distributed and scalable environment. In particular, the project intended to focus on the computing and visualization capabilities of the stack, which at the time were rather primitive. At the beginning, the team had the general notion that a better ESGF architecture could be built by modularizing each component, and redefining its interaction with other components by defining and exposing a well defined API. Although this was still the high level principle that guided the work, the DREAM project was able to accomplish its goals by leveraging new practices in IT that started just about 3 or 4 years ago: the advent of containerization technologies (specifically, Docker), the development of frameworks to manage containers at scale (Docker Swarm and Kubernetes), and their application to the commercial Cloud. Thanks to these new technologies, DREAM was able to improve the ESGF architecture (including its computing and visualization services) to a level of deployability and scalability beyond the original expectations.
CPU Ray Tracing of Tree-Based Adaptive Mesh Refinement Data|
F. Wang, N. Marshak, W. Usher, C. Burstedde, A. Knoll, T. Heister, C. R. Johnson. In Eurographics Conference on Visualization (EuroVis) 2020, Vol. 39, No. 3, 2020.
Adaptive mesh refinement (AMR) techniques allow for representing a simulation’s computation domain in an adaptive fashion. Although these techniques have found widespread adoption in high-performance computing simulations, visualizing their data output interactively and without cracks or artifacts remains challenging. In this paper, we present an efficient solution for direct volume rendering and hybrid implicit isosurface ray tracing of tree-based AMR (TB-AMR) data. We propose a novel reconstruction strategy, Generalized Trilinear Interpolation (GTI), to interpolate across AMR level boundaries without cracks or discontinuities in the surface normal. We employ a general sparse octree structure supporting a wide range of AMR data, and use it to accelerate volume rendering, hybrid implicit isosurface rendering and value queries. We demonstrate that our approach achieves artifact-free isosurface and volume rendering and provides higher quality output images compared to existing methods at interactive rendering rates.
A convected particle least square interpolation material point method|
Q. A. Tran, W. Sołowski, M. Berzins, J. Guilkey. In International Journal for Numerical Methods in Engineering, Wiley, October, 2019.
Applying the convected particle domain interpolation (CPDI) to the material point method has many advantages over the original material point method, including significantly improved accuracy. However, in the large deformation regime, the CPDI still may not retain the expected convergence rate. The paper proposes an enhanced CPDI formulation based on least square reconstruction technique. The convected particle least square interpolation (CPLS) material point method assumes the velocity field inside the material point domain as nonconstant. This velocity field in the material point domain is mapped to the background grid nodes with a moving least squares reconstruction. In this paper, we apply the improved moving least squares method to avoid the instability of the conventional moving least squares method due to a singular matrix. The proposed algorithm can improve convergence rate, as illustrated by numerical examples using the method of manufactured solutions.
In situ visualization of performance metrics in multiple domains|
A. Sanderson, A. Humphrey, J. Schmidt, R. Sisneros,, M. Papka. In 2019 IEEE/ACM International Workshop on Programming and Performance Visualization Tools (ProTools), IEEE, Nov, 2019.
As application scientists develop and deploy simulation codes on to leadership-class computing resources, there is a need to instrument these codes to better understand performance to efficiently utilize these resources. This instrumentation may come from independent third-party tools that generate and store performance metrics or from custom instrumentation tools built directly into the application. The metrics collected are then available for visual analysis, typically in the domain in which there were collected. In this paper, we introduce an approach to visualize and analyze the performance metrics in situ in the context of the machine, application, and communication domains (MAC model) using a single visualization tool. This visualization model provides a holistic view of the application performance in the context of the resources where it is executing.
A Portable SIMD Primitive using Kokkos for Heterogeneous Architectures|
D. Sahasrabudhe, E. T. Phipps, S. Rajamanickam, M. Berzins. In Sixth Workshop on Accelerator Programming Using Directives (WACCPD), 2019.
As computer architectures are rapidly evolving (e.g. those designed for exascale), multiple portability frameworks have been developed to avoid new architecture-specific development and tuning. However, portability frameworks depend on compilers for auto-vectorization and may lack support for explicit vectorization on heterogeneous platforms. Alternatively, programmers can use intrinsics-based primitives to achieve more efficient vectorization, but the lack of a gpu back-end for these primitives makes such code non-portable. A unified, portable, Single Instruction Multiple Data (simd) primitive proposed in this work, allows intrinsics-based vectorization on cpus and many-core architectures such as Intel Knights Landing (knl), and also facilitates Single Instruction Multiple Threads (simt) based execution on gpus. This unified primitive, coupled with the Kokkos portability ecosystem, makes it possible to develop explicitly vectorized code, which is portable across heterogeneous platforms. The new simd primitive is used on different architectures to test the performance boost against hard-to-auto-vectorize baseline, to measure the overhead against efficiently vectroized baseline, and to evaluate the new feature called the \logical vector length" (lvl). The simd primitive provides portability across cpus and gpus without any performance degradation being observed experimentally.