PrefixRL: Optimization of Parallel Prefix Circuits using Deep Reinforcement Learning|
R. Roy, J. Raiman, N. Kant, I. Elkin, R. Kirby, M. Siu, S. Oberman, S. Godil, B. Catanzaro. In 2021 58th ACM/IEEE Design Automation Conference (DAC), IEEE, pp. 853-858. 2021.
In this work, we present a reinforcement learning (RL) based approach to designing parallel prefix circuits such as adders or priority encoders that are fundamental to high-performance digital design. Unlike prior methods, our approach designs solutions tabula rasa purely through learning with synthesis in the loop. We design a grid-based state-action representation and an RL environment for constructing legal prefix circuits. Deep Convolutional RL agents trained on this environment produce prefix adder circuits that Pareto-dominate existing baselines with up to 16.0% and 30.2% lower area for the same delay in the 32b and 64b settings respectively. We observe that agents trained with open-source synthesis tools and cell library can design adder circuits that achieve lower area and delay than commercial tool adders in an industrial cell library.
Uintah+Hedgehog: Combining Parallelism Models for End-to-End Large-Scale Simulation Performance|
J. K. Holmen, D. Sahasrabudhe, M. Berzins, A. Bardakoff, T. J. Blattner, . Keyrouz. Scientific Computing and Imaging Institute, 2021.
The complexity of heterogeneous nodes near and at exascale has increased the need for “heroic” programming efforts. To accommodate this complexity, significant investment is required for codes not yet optimizing for low-level architecture features (e.g., wide vector units) and/or running at large-scale. This paper describes ongoing efforts to combine two codes, Hedgehog and Uintah, lying at both extremes to ease programming efforts. The end goals of this effort are (1) to combine the two codes to make an asynchronous many-task runtime system specializing in both node-level and large-scale performance and (2) to further improve the accessibility of both with portable abstractions. A prototype adopting Hedgehog in Uintah and a prototype extending Hedgehog to support MPI+X hybrid parallelism are discussed. Results achieving ∼60% of NVIDIA V100 GPU peak performance for a distributed DGEMM problem are shown for a naive MPI+Hedgehog implementation before any attempt to optimize for performance.
Enabling microservices management for Deep Learning applications across the Edge-Cloud Continuum|
Z. Houmani, D. Balouek-Thomert, E. Caron, M. Parashar. In SBAC-PAD 2021 - IEEE 33rd International Symposium on Computer Architecture and High Performance Computing, October, 2021.
Deep Learning has shifted the focus of traditional batch workflows to data-driven feature engineering on streaming data. In particular, the execution of Deep Learning workflows presents expectations of near-real-time results with user-defined acceptable accuracy. Meeting the objectives of such applications across heterogeneous resources located at the edge of the network, the core, and in-between requires managing trade-offs between the accuracy and the urgency of the results. However, current data analysis rarely manages the entire Deep Learning pipeline along the data path, making it complex for developers to implement strategies in real-world deployments. Driven by an object detection use case, this paper presents an architecture for time-critical Deep Learning workflows by providing a data-driven scheduling approach to distribute the pipeline across Edge to Cloud resources. Furthermore, it adopts a data management strategy that reduces the resolution of incoming data when potential trade-off optimizations are available. We illustrate the system's viability through a performance evaluation of the object detection use case on the Grid'5000 testbed. We demonstrate that in a multiuser scenario, with a standard frame rate of 25 frames per second, the system speed-up data analysis up to 54.4% compared to a Cloud-only-based scenario with an analysis accuracy higher than a fixed threshold.
Physics-Informed Neural Networks (PINNs) for Parameterized PDEs: A Metalearning Approach|
Subtitled arXiv preprint arXiv:2110.13361, M. Penwarden, S. Zhe, A. Narayan, R. M. Kirby. 2021.
Physics-informed neural networks (PINNs) as a means of discretizing partial differential equations (PDEs) are garnering much attention in the Computational Science and Engineering (CS&E) world. At least two challenges exist for PINNs at present: an understanding of accuracy and convergence characteristics with respect to tunable parameters and identification of optimization strategies that make PINNs as efficient as other computational science tools. The cost of PINNs training remains a major challenge of Physics-informed Machine Learning (PiML) – and, in fact, machine learning (ML) in general. This paper is meant to move towards addressing the latter through the study of PINNs for parameterized PDEs. Following the ML world, we introduce metalearning of PINNs for parameterized PDEs. By introducing metalearning and transfer learning concepts, we can greatly accelerate the PINNs optimization process. We present a survey of model-agnostic metalearning, and then discuss our model-aware metalearning applied to PINNs. We provide theoretically motivated and empirically backed assumptions that make our metalearning approach possible. We then test our approach on various canonical forward parameterized PDEs that have been presented in the emerging PINNs literature.
Exploring the Role of Machine Learning in Scientific Workflows: Opportunities and Challenges|
Subtitled arXiv preprint arXiv:2110.13999, A. Nouri, P.E. Davis, P. Subedi, M. Parashar. 2021.
In this survey, we discuss the challenges of executing scientific workflows as well as existing Machine Learning (ML) techniques to alleviate those challenges. We provide the context and motivation for applying ML to each step of the execution of these workflows. Furthermore, we provide recommendations on how to extend ML techniques to unresolved challenges in the execution of scientific workflows. Moreover, we discuss the possibility of using ML techniques for in-situ operations. We explore the challenges of in-situ workflows and provide suggestions for improving the performance of their execution using ML techniques.
Meta-Learning with Adjoint Methods|
Subtitled arXiv preprint arXiv:2110.08432, S. Li, Z. Wang, A. Narayan, R. Kirby, S. Zhe. 2021.
Model Agnostic Meta-Learning (MAML) is widely used to find a good initialization for a family of tasks. Despite its success, a critical challenge in MAML is to calculate the gradient w.r.t the initialization of a long training trajectory for the sampled tasks, because the computation graph can rapidly explode and the computational cost is very expensive. To address this problem, we propose Adjoint MAML (A-MAML). We view gradient descent in the inner optimization as the evolution of an Ordinary Differential Equation (ODE). To efficiently compute the gradient of the validation loss w.r.t the initialization, we use the adjoint method to construct a companion, backward ODE. To obtain the gradient w.r.t the initialization, we only need to run the standard ODE solver twice -- one is forward in time that evolves a long trajectory of gradient flow for the sampled task; the other is backward and solves the adjoint ODE. We need not create or expand any intermediate computational graphs, adopt aggressive approximations, or impose proximal regularizers in the training loss. Our approach is cheap, accurate, and adaptable to different trajectory lengths. We demonstrate the advantage of our approach in both synthetic and real-world meta-learning tasks.
Scalable Graph Embedding LearningOn A Single GPU|
Subtitled arXiv preprint arXiv:2110.06991, A. Nouri, P.E. Davis, P. Subedi, M. Parashar. 2021.
Graph embedding techniques have attracted growing interest since they convert the graph data into continuous and low-dimensional space. Effective graph analytic provides users a deeper understanding of what is behind the data and thus can benefit a variety of machine learning tasks. With the current scale of real-world applications, most graph analytic methods suffer high computation and space costs. These methods and systems can process a network with thousands to a few million nodes. However, scaling to large-scale networks remains a challenge. The complexity of training graph embedding system requires the use of existing accelerators such as GPU. In this paper, we introduce a hybrid CPU-GPU framework that addresses the challenges of learning embedding of large-scale graphs. The performance of our method is compared qualitatively and quantitatively with the existing embedding systems on common benchmarks. We also show that our system can scale training to datasets with an order of magnitude greater than a single machine's total memory capacity. The effectiveness of the learned embedding is evaluated within multiple downstream applications. The experimental results indicate the effectiveness of the learned embedding in terms of performance and accuracy.
|RISE: Reducing I/O Contention in Staging-based Extreme-Scale In-situ Workflows,
P. Subedi, P.E .Davis, M. Parashar. In 2021 IEEE International Conference on Cluster Computing (CLUSTER), pp. 146--156. 2021.
While in-situ workflow formulations have addressed some of the data-related challenges associated with extreme-scale scientific workflows, these workflows involve complex interactions and different modes of data exchange. In the context of increasing system complexity, such workflows present significant resource management challenges, requiring complex cost-performance tradeoffs. This paper presents RISE, an intelligent staging-based data management middleware, which builds on the DataSpaces framework and performs intelligent scheduling of data management operations to reduce I/O contention. In RISE, data are always written immediately to local buffers to reduce the effect of the transfer impact upon application performance. RISE identifies applications’ data access patterns and moves data towards data consumers only when the network is expected to be idle, reducing the impact of asynchronous …
Structured Adaptive Mesh Refinement Adaptations to Retain Performance Portability With Increasing Heterogeneity|
A. Dubey, M. Berzins, C. Burstedde, M.l L. Norman, D. Unat, M. Wahib. In Computing in Science & Engineering, Vol. 23, No. 5, pp. 62-66. 2021.
Adaptive mesh refinement (AMR) is an important method that enables many mesh-based applications to run at effectively higher resolution within limited computing resources by allowing high resolution only where really needed. This advantage comes at a cost, however: greater complexity in the mesh management machinery and challenges with load distribution. With the current trend of increasing heterogeneity in hardware architecture, AMR presents an orthogonal axis of complexity. The usual techniques, such as asynchronous communication and hierarchy management for parallelism and memory that are necessary to obtain reasonable performance are very challenging to reason about with AMR. Different groups working with AMR are bringing different approaches to this challenge. Here, we examine the design choices of several AMR codes and also the degree to which demands placed on them by their users influence these choices.
Characterizing possible failure modes in physics-informed neural networks|
Subtitled arXiv preprint arXiv:2109.01050, A.S. Krishnapriyan, A. Gholami, S. Zhe, R.M. Kirby, M.W. Mahoney. 2021.
Recent work in scientific machine learning has developed so-called physics-informed neural network (PINN) models. The typical approach is to incorporate physical domain knowledge as soft constraints on an empirical loss function and use existing machine learning methodologies to train the model. We demonstrate that, while existing PINN methodologies can learn good models for relatively trivial problems, they can easily fail to learn relevant physical phenomena even for simple PDEs. In particular, we analyze several distinct situations of widespread physical interest, including learning differential equations with convection, reaction, and diffusion operators. We provide evidence that the soft regularization in PINNs, which involves differential operators, can introduce a number of subtle problems, including making the problem ill-conditioned. Importantly, we show that these possible failure modes are not due to the lack of expressivity in the NN architecture, but that the PINN's setup makes the loss landscape very hard to optimize. We then describe two promising solutions to address these failure modes. The first approach is to use curriculum regularization, where the PINN's loss term starts from a simple PDE regularization, and becomes progressively more complex as the NN gets trained. The second approach is to pose the problem as a sequence-to-sequence learning task, rather than learning to predict the entire space-time at once. Extensive testing shows that we can achieve up to 1-2 orders of magnitude lower error with these methods as compared to regular PINN training.
Guiding Global Placement With Reinforcement Learning|
Subtitled arXiv preprint arXiv:2109.02631, R. Kirby, K. Nottingham, R. Roy, S. Godil, B. Catanzaro. 2021.
Recent advances in GPU accelerated global and detail placement have reduced the time to solution by an order of magnitude. This advancement allows us to leverage data driven optimization (such as Reinforcement Learning) in an effort to improve the final quality of placement results. In this work we augment state-of-the-art, force-based global placement solvers with a reinforcement learning agent trained to improve the final detail placed Half Perimeter Wire Length (HPWL). We propose novel control schemes with either global or localized control of the placement process. We then train reinforcement learning agents to use these controls to guide placement to improved solutions. In both cases, the augmented optimizer finds improved placement solutions. Our trained agents achieve an average 1% improvement in final detail place HPWL across a range of academic benchmarks and more than 1% in global place HPWL on real industry designs.
Non-Dissipative and Structure-Preserving Emulators via Spherical Optimization|
Subtitled arXiv:2108.12053, D. Dai, Y. Epshteyn, A. Narayan. 2021.
Approximating a function with a finite series, eg, involving polynomials or trigonometric functions, is a critical tool in computing and data analysis. The construction of such approximations via now-standard approaches like least squares or compressive sampling does not ensure that the approximation adheres to certain convex linear structural constraints, such as positivity or monotonicity. Existing approaches that ensure such structure are norm-dissipative and this can have a deleterious impact when applying these approaches, eg, when numerical solving partial differential equations. We present a new framework that enforces via optimization such structure on approximations and is simultaneously norm-preserving. This results in a conceptually simple convex optimization problem on the sphere, but the feasible set for such problems can be very complex. We establish well-posedness of the optimization problem through results on spherical convexity and design several spherical-projection-based algorithms to numerically compute the solution. Finally, we demonstrate the effectiveness of this approach through several numerical examples.
|Robust topology optimization with low rank approximation using artificial neural networks,
V. Keshavarzzadeh, R. M. Kirby, A. Narayan. In Computational Mechanics, 2021.
We present a low rank approximation approach for topology optimization of parametrized linear elastic structures. The parametrization is considered on loading and stiffness of the structure. The low rank approximation is achieved by identifying a parametric connection among coarse finite element models of the structure (associated with different design iterates) and is used to inform the high fidelity finite element analysis. We build an Artificial Neural Network (ANN) map between low resolution design iterates and their corresponding interpolative coefficients (obtained from low rank approximations) and use this surrogate to perform high resolution parametric topology optimization. We demonstrate our approach on robust topology optimization with compliance constraints/objective functions and develop error bounds for the the parametric compliance computations. We verify these parametric computations with more challenging quantities of interest such as the p-norm of von Mises stress. To conclude, we use our approach on a 3D robust topology optimization and show significant reduction in computational cost via quantitative measures.
N. Truong, C. Yuksel, C. Watcharopas, J. A. Levine, R. M. Kirby. In IEEE Transactions on Visualization and Computer Graphics, IEEE, 2021.
Robustly handling collisions between individual particles in a large particle-based simulation has been a challenging problem. We introduce particle merging-and-splitting, a simple scheme for robustly handling collisions between particles that prevents inter-penetrations of separate objects without introducing numerical instabilities. This scheme merges colliding particles at the beginning of the time-step and then splits them at the end of the time-step. Thus, collisions last for the duration of a time-step, allowing neighboring particles of the colliding particles to influence each other. We show that our merging-and-splitting method is effective in robustly handling collisions and avoiding penetrations in particle-based simulations. We also show how our merging-and-splitting approach can be used for coupling different simulation systems using different and otherwise incompatible integrators. We present simulation tests …
Multifidelity Modeling for Physics-Informed Neural Networks (PINNs)|
Subtitled arXiv preprint arXiv:2106.13361, M. Penwarden, S. Zhe, A. Narayan, R. M. Kirby. 2021.
Multifidelity simulation methodologies are often used in an attempt to judiciously combine low-fidelity and high-fidelity simulation results in an accuracy-increasing, cost-saving way. Candidates for this approach are simulation methodologies for which there are fidelity differences connected with significant computational cost differences. Physics-informed Neural Networks (PINNs) are candidates for these types of approaches due to the significant difference in training times required when different fidelities (expressed in terms of architecture width and depth as well as optimization criteria) are employed. In this paper, we propose a particular multifidelity approach applied to PINNs that exploits low-rank structure. We demonstrate that width, depth, and optimization criteria can be used as parameters related to model fidelity, and show numerical justification of cost differences in training due to fidelity parameter choices. We test our multifidelity scheme on various canonical forward PDE models that have been presented in the emerging PINNs literature.
Facilitating Data Discovery for Large-scale Science Facilities using Knowledge Networks|
Y. Qin, I. Rodero, M. Parashar. In 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS), IEEE, pp. 651-660. 2021.
Large-scale multiuser scientific facilities, such as geographically distributed observatories, remote instruments, and experimental platforms, represent some of the largest national investments and can enable dramatic advances across many areas of science. Recent examples of such advances include the detection of gravitational waves and the imaging of a black hole’s event horizon. However, as the number of such facilities and their users grow, along with the complexity, diversity, and volumes of their data products, finding and accessing relevant data is becoming increasingly challenging, limiting the potential impact of facilities. These challenges are further amplified as scientists and application workflows increasingly try to integrate facilities’ data from diverse domains. In this paper, we leverage concepts underlying recommender systems, which are extremely effective in e-commerce, to address these data-discovery and data-access challenges for large-scale distributed scientific facilities. We first analyze data from facilities and identify and model user-query patterns in terms of facility location and spatial localities, domain-specific data models, and user associations. We then use this analysis to generate a knowledge graph and develop the collaborative knowledge-aware graph attention network (CKAT) recommendation model, which leverages graph neural networks (GNNs) to explicitly encode the collaborative signals through propagation and combine them with knowledge associations. Moreover, we integrate a knowledge-aware neural attention mechanism to enable the CKAT to pay more attention to key information while reducing irrelevant noise, thereby increasing the accuracy of the recommendations. We apply the proposed model on two real-world facility datasets and empirically demonstrate that the CKAT can effectively facilitate data discovery, significantly outperforming several compelling state-of-the-art baseline models.
Facilitating Staging-based Unstructured Mesh Processing to Support Hybrid In-Situ Workflows|
Z. Wang, P. Subedi, M. Dorier, P.E. Davis, M. Parashar. In 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 960-964. 2021.
In-situ and in-transit processing alleviate the gap between the computing and I/O capabilities by scheduling data analytics close to the data source. Hybrid in-situ processing splits data analytics into two stages: the data processing that runs in-situ aims to extract regions of interest, which are then transferred to staging services for further in-transit analytics. To facilitate this type of hybrid in-situ processing, the data staging service needs to support complex intermediate data representations generated/consumed by the in-situ tasks. Unstructured (or irregular) mesh is one such derived data representation that is typically used and bridges simulation data and analytics. However, how staging services efficiently support unstructured mesh transfer and processing remains to be explored. This paper investigates design options for transferring and processing unstructured mesh data using staging services. Using polygonal mesh data as an example, we show that hybrid in-situ workflows with staging-based unstructured mesh processing can effectively support hybrid in-situ workflows, and can significantly decrease data movement overheads.
|Investigating In Situ Reduction via Lagrangian Representations for Cosmology and Seismology Applications,
S. Sane, C. R. Johnson, H. Childs. In Computational Science -- ICCS 2021, Springer International Publishing, pp. 436--450. 2021.
Although many types of computational simulations produce time-varying vector fields, subsequent analysis is often limited to single time slices due to excessive costs. Fortunately, a new approach using a Lagrangian representation can enable time-varying vector field analysis while mitigating these costs. With this approach, a Lagrangian representation is calculated while the simulation code is running, and the result is explored after the simulation. Importantly, the effectiveness of this approach varies based on the nature of the vector field, requiring in-depth investigation for each application area. With this study, we evaluate the effectiveness for previously unexplored cosmology and seismology applications. We do this by considering encumbrance (on the simulation) and accuracy (of the reconstructed result). To inform encumbrance, we integrated in situ infrastructure with two simulation codes, and evaluated on representative HPC environments, performing Lagrangian in situ reduction using GPUs as well as CPUs. To inform accuracy, our study conducted a statistical analysis across a range of spatiotemporal configurations as well as a qualitative evaluation. In all, we demonstrate effectiveness for both cosmology and seismology—time-varying vector fields from these domains can be reduced to less than 1% of the total data via Lagrangian representations, while maintaining accurate reconstruction and requiring under 10% of total execution time in over 80% of our experiments.
Structure-preserving Nonlinear Filtering for Continuous and Discontinuous Galerkin Spectral/hp Element Methods|
Subtitled arXiv preprint arXiv:2106.08316, V. Zala, R. M. Kirby, A. Narayan. 2021.
Finite element simulations have been used to solve various partial differential equations (PDEs) that model physical, chemical, and biological phenomena. The resulting discretized solutions to PDEs often do not satisfy requisite physical properties, such as positivity or monotonicity. Such invalid solutions pose both modeling challenges, since the physical interpretation of simulation results is not possible, and computational challenges, since such properties may be required to advance the scheme. We, therefore, consider the problem of computing solutions that preserve these structural solution properties, which we enforce as additional constraints on the solution. We consider in particular the class of convex constraints, which includes positivity and monotonicity. By embedding such constraints as a postprocessing convex optimization procedure, we can compute solutions that satisfy general types of convex constraints. For certain types of constraints (including positivity and monotonicity), the optimization is a filter, i.e., a norm-decreasing operation. We provide a variety of tests on one-dimensional time-dependent PDEs that demonstrate the method's efficacy, and we empirically show that rates of convergence are unaffected by the inclusion of the constraints.
Leveraging user access patterns and advanced cyberinfrastructure to accelerate data delivery from shared-use scientific observatories|
Y. Qin, I. Rodero, A. Simonet, C. Meertens, D. Reiner, J. Riley, M. Parashar. In Future Generation Computer Systems, North-Holland, pp. 14-27. 2021.
With the growing number and increasing availability of shared-use instruments and observatories, observational data is becoming an essential part of application workflows and contributor to scientific discoveries in a range of disciplines. However, the corresponding growth in the number of users accessing these facilities coupled with the expansion in the scale and variety of the data, is making it challenging for these facilities to ensure their data can be accessed, integrated, and analyzed in a timely manner, and is resulting significant demands on their cyberinfrastructure (CI). In this paper, we present the design of a push-based data delivery framework that leverages emerging in-network capabilities, along with data pre-fetching techniques based on a hybrid data management model. Specifically, we analyze data access traces for two large-scale observatories, Ocean Observatories Initiative (OOI) and Geodetic Facility for the Advancement of Geoscience (GAGE), to identify typical user access patterns and to develop a model that can be used for data pre-fetching. Furthermore, we evaluate our data pre-fetching model and the proposed framework using a simulation of the Virtual Data Collaboratory (VDC) platform that provides in-network data staging and processing capabilities. The results demonstrate that the ability of the framework to significantly improve data delivery performance and reduce network traffic at the observatories’ facilities.