V. Zala, R. M. Kirby, A. Narayan. Structure-preserving Nonlinear Filtering for Continuous and Discontinuous Galerkin Spectral/hp Element Methods, Subtitled arXiv preprint arXiv:2106.08316, 2021.
Finite element simulations have been used to solve various partial differential equations (PDEs) that model physical, chemical, and biological phenomena. The resulting discretized solutions to PDEs often do not satisfy requisite physical properties, such as positivity or monotonicity. Such invalid solutions pose both modeling challenges, since the physical interpretation of simulation results is not possible, and computational challenges, since such properties may be required to advance the scheme. We, therefore, consider the problem of computing solutions that preserve these structural solution properties, which we enforce as additional constraints on the solution. We consider in particular the class of convex constraints, which includes positivity and monotonicity. By embedding such constraints as a postprocessing convex optimization procedure, we can compute solutions that satisfy general types of convex constraints. For certain types of constraints (including positivity and monotonicity), the optimization is a filter, i.e., a norm-decreasing operation. We provide a variety of tests on one-dimensional time-dependent PDEs that demonstrate the method's efficacy, and we empirically show that rates of convergence are unaffected by the inclusion of the constraints.
R. Zambre, D. Sahasrabudhe, H. Zhou, M. Berzins, A. Chandramowlishwaran, P. Balaji. Logically Parallel Communication for Fast MPI+Threads Communication, In Proceedings of the Transactions on Parallel and Distributed Computing, IEEE, April, 2021.
Supercomputing applications are increasingly adopting the MPI+threads programming model over the traditional “MPI everywhere” approach to better handle the disproportionate increase in the number of cores compared with other on-node resources. In practice, however, most applications observe a slower performance with MPI+threads primarily because of poor communication performance. Recent research efforts on MPI libraries address this bottleneck by mapping logically parallel communication, that is, operations that are not subject to MPI’s ordering constraints to the underlying network parallelism. Domain scientists, however, typically do not expose such communication independence information because the existing MPI-3.1 standard’s semantics can be limiting. Researchers had initially proposed user-visible endpoints to combat this issue, but such a solution requires intrusive changes to the standard (new APIs). The upcoming MPI-4.0 standard, on the other hand, allows applications to relax unneeded semantics and provides them with many opportunities to express logical communication parallelism. In this paper, we show how MPI+threads applications can achieve high performance with logically parallel communication. Through application case studies, we compare the capabilities of the new MPI-4.0 standard with those of the existing one and user-visible endpoints (upper bound). Logical communication parallelism can boost the overall performance of an application by over 2x.
B. Zenger, W. W. Good, J. A. Bergquist, L. C. Rupp, M. Perez, G. J. Stoddard, V. Sharma, R. S. MacLeod. Transient recovery of epicardial and torso ST-segment ischemic signals during cardiac stress tests: A possible physiological mechanism, In Journal of Electrocardiology, Churchill Livingstone, 2021.
Acute myocardial ischemia has several characteristic ECG findings, including clinically detectable ST-segment deviations. However, the sensitivity and specificity of diagnosis based on ST-segment changes are low. Furthermore, ST-segment deviations have been shown to be transient and spontaneously recover without any indication the ischemic event has subsided.
Assess the transient recovery of ST-segment deviations on remote recording electrodes during a partial occlusion cardiac stress test and compare them to intramyocardial ST-segment deviations.
We used a previously validated porcineBZ experimental model of acute myocardial ischemia with controllable ischemic load and simultaneous electrical measurements within the heart wall, on the epicardial surface, and on the torso surface. Simulated cardiac stress tests were induced by occluding a coronary artery while simultaneously pacing rapidly or infusing dobutamine to stimulate cardiac function. Postexperimental imaging created anatomical models for data visualization and quantification. Markers of ischemia were identified as deviations in the potentials measured at 40% of the ST-segment. Intramural cardiac conduction speed was also determined using the inverse gradient method. We assessed changes in intramyocardial ischemic volume proportion, conduction speed, clinical presence of ischemia on remote recording arrays, and regional changes to intramyocardial ischemia. We defined the peak deviation response time as the time interval after onset of ischemia at which maximum ST-segment deviation was achieved, and ST-recovery time was the interval when ST deviation returned to below thresholded of ST elevation.
In both epicardial and torso recordings, the peak ST-segment deviation response time was 4.9±1.1 min and the ST-recovery time was approximately 7.9±2.5 min, both well before the termination of the ischemic stress. At peak response time, conduction speed was reduced by 50% and returned to near baseline at ST-recovery. The overall ischemic volume proportion initially increased, on average, to 37% at peak response time; however, it recovered only to 30% at the ST-recovery time. By contrast, the subepicardial region of the myocardial wall showed 40% ischemic volume at peak response time and recovered much more strongly to 25% as epicardial ST-segment deviations returned to baseline.
Our data show that remote ischemic signal recovery correlates with a recovery of the subepicardial myocardium, while subendocardial ischemic development persists.
L. Zhou, C. R. Johnson, D. Weiskopf. Data-Driven Space-Filling Curves, In IEEE Transactions on Visualization and Computer Graphics, Vol. 27, No. 2, IEEE, pp. 1591-1600. 2021.
We propose a data-driven space-filling curve method for 2D and 3D visualization. Our flexible curve traverses the data elements in the spatial domain in a way that the resulting linearization better preserves features in space compared to existing methods. We achieve such data coherency by calculating a Hamiltonian path that approximately minimizes an objective function that describes the similarity of data values and location coherency in a neighborhood. Our extended variant even supports multiscale data via quadtrees and octrees. Our method is useful in many areas of visualization, including multivariate or comparative visualization,ensemble visualization of 2D and 3D data on regular grids, or multiscale visual analysis of particle simulations. The effectiveness of our method is evaluated with numerical comparisons to existing techniques and through examples of ensemble and multivariate datasets.
Y. Zhou, N. Chalapathi, A. Rathore, Y. Zhao, Bei Wang. Mapper Interactive: A Scalable, Extendable, and Interactive Toolbox for the Visual Exploration of High-Dimensional Data., In IEEE Pacific Visualization Symposium, 2021.
The mapper algorithm is a popular tool from topological data analysis for extracting topological summaries of high-dimensional datasets. In this paper, we present Mapper Interactive, a web-based framework for the interactive analysis and visualization of high-dimensional point cloud data. It implements the mapper algorithm in an interactive, scalable, and easily extendable way, thus supporting practical data analysis. In particular, its command-line API can compute mapper graphs for 1 million points of 256 dimensions in about 3 minutes (4 times faster than the vanilla implementation). Its visual interface allows on-the-fly computation and manipulation of the mapper graph based on user-specified parameters and supports the addition of new analysis modules with a few lines of code. Mapper Interactive makes the mapper algorithm accessible to nonspecialists and accelerates topological analytics workflows.
T. M. Athawale, D. Maljovec, L. Yan, C. R. Johnson, V. Pascucci,, B. Wang. Uncertainty Visualization of 2D Morse Complex Ensembles using Statistical Summary Maps, In IEEE Transactions on Visualization and Computer Graphics, 2020.
Morse complexes are gradient-based topological descriptors with close connections to Morse theory. They are widely applicable in scientific visualization as they serve as important abstractions for gaining insights into the topology of scalar fields. Noise inherent to scalar field data due to acquisitions and processing, however, limits our understanding of the Morse complexes as structural abstractions. We, therefore, explore uncertainty visualization of an ensemble of 2D Morse complexes that arise from scalar fields coupled with data uncertainty. We propose statistical summary maps as new entities for capturing structural variations and visualizing positional uncertainties of Morse complexes in ensembles. Specifically, we introduce two types of statistical summary maps -- the Probabilistic Map and the Survival Map -- to characterize the uncertain behaviors of local extrema and local gradient flows, respectively. We demonstrate the utility of our proposed approach using synthetic and real-world datasets.
H. Childs, S. D. Ahern, J. Ahrens, A. C. Bauer, J. Bennett, E. W. Bethel, P. Bremer, E. Brugger, J. Cottam, M. Dorier, S. Dutta, J. M. Favre, T. Fogal, S. Frey, C. Garth, B. Geveci, W. F. Godoy, C. D. Hansen, C. Harrison, B. Hentschel, J. Insley, C. R. Johnson, S. Klasky, A. Knoll, J. Kress, M. Larsen, J. Lofstead, K. Ma, P. Malakar, J. Meredith, K. Moreland, P. Navratil, P. O’Leary, M. Parashar, V. Pascucci, J. Patchett, T. Peterka, S. Petruzza, N. Podhorszki, D. Pugmire, M. Rasquin, S. Rizzi, D. H. Rogers, S. Sane, F. Sauer, R. Sisneros, H. Shen, W. Usher, R. Vickery, V. Vishwanath, I. Wald, R. Wang, G. H. Weber, B. Whitlock, M. Wolf, H. Yu, S. B. Ziegeler. A Terminology for In Situ Visualization and Analysis Systems, In International Journal of High Performance Computing Applications, Vol. 34, No. 6, pp. 676–691. 2020.
The term “in situ processing” has evolved over the last decade to mean both a specific strategy for visualizing and analyzing data and an umbrella term for a processing paradigm. The resulting confusion makes it difficult for visualization and analysis scientists to communicate with each other and with their stakeholders. To address this problem, a group of over fifty experts convened with the goal of standardizing terminology. This paper summarizes their findings and proposes a new terminology for describing in situ systems. An important finding from this group was that in situ systems are best described via multiple, distinct axes: integration type, proximity, access, division of execution, operation controls, and output type. This paper discusses these axes, evaluates existing systems within the axes, and explores how currently used terms relate to the axes.
L. Cinquini, S. Petruzza, Jason J. Boutte, S. Ames, G. Abdulla, V. Balaji, R. Ferraro, A. Radhakrishnan, L. Carriere, T. Maxwell, G. Scorzelli, V. Pascucci. Distributed Resources for the Earth System Grid Advanced Management (DREAM), Final Report, 2020.
The DREAM project was funded more than 3 years ago to design and implement a next-generation ESGF (Earth System Grid Federation ) architecture which would be suitable for managing and accessing data and services resources on a distributed and scalable environment. In particular, the project intended to focus on the computing and visualization capabilities of the stack, which at the time were rather primitive. At the beginning, the team had the general notion that a better ESGF architecture could be built by modularizing each component, and redefining its interaction with other components by defining and exposing a well defined API. Although this was still the high level principle that guided the work, the DREAM project was able to accomplish its goals by leveraging new practices in IT that started just about 3 or 4 years ago: the advent of containerization technologies (specifically, Docker), the development of frameworks to manage containers at scale (Docker Swarm and Kubernetes), and their application to the commercial Cloud. Thanks to these new technologies, DREAM was able to improve the ESGF architecture (including its computing and visualization services) to a level of deployability and scalability beyond the original expectations.
M. Han, I. Wald, W. Usher, N. Morrical, A. Knoll, V. Pascucci, C.R. Johnson. A virtual frame buffer abstraction for parallel rendering of large tiled display walls, In 2020 IEEE Visualization Conference (VIS), pp. 11--15. 2020.
We present dw2, a flexible and easy-to-use software infrastructure for interactive rendering of large tiled display walls. Our library represents the tiled display wall as a single virtual screen through a display "service", which renderers connect to and send image tiles to be displayed, either from an on-site or remote cluster. The display service can be easily configured to support a range of typical network and display hardware configurations; the client library provides a straightforward interface for easy integration into existing renderers. We evaluate the performance of our display wall service in different configurations using a CPU and GPU ray tracer, in both on-site and remote rendering scenarios using multiple display walls.
A. P. Janson, D. N. Anderson, C. R. Butson. Activation robustness with directional leads and multi-lead configurations in deep brain stimulation, In Journal of Neural Engineering, Vol. 17, No. 2, IOP Publishing, pp. 026012. March, 2020.
Objective: Clinical outcomes from deep brain stimulation (DBS) can be highly variable, and two critical factors underlying this variability are the location and type of stimulation. In this study we quantified how robustly DBS activates a target region when taking into account a range of different lead designs and realistic variations in placement. The objective of the study is to assess the likelihood of achieving target activation.
Approach: We performed finite element computational modeling and established a metric of performance robustness to evaluate the ability of directional and multi-lead configurations to activate target fiber pathways while taking into account location variability. A more robust lead configuration produces less variability in activation across all stimulation locations around the target.
Main results: Directional leads demonstrated higher overall performance robustness compared to axisymmetric leads, primarily 1-2 mm outside of the target. Multi-lead configurations demonstrated higher levels of robustness compared to any single lead due to distribution of electrodes in a broader region around the target.
Significance: Robustness measures can be used to evaluate the performance of existing DBS lead designs and aid in the development of novel lead designs to better accommodate known variability in lead location and orientation. This type of analysis may also be useful to understand how DBS clinical outcome variability is influenced by lead location among groups of patients.
C. R. Johnson, T. Kapur, W. Schroeder,, T. Yoo. Remembering Bill Lorensen: The Man, the Myth, and Marching Cubes, In IEEE Computer Graphics and Applications, Vol. 40, No. 2, pp. 112-118. March, 2020.
K. A. Johnson, G. Duffley, D. Nesterovich Anderson, J. L. Ostrem, M. Welter, J. C. Baldermann, J. Kuhn, D. Huys, V. Visser-Vandewalle, T. Foltynie, L. Zrinzo, M. Hariz, A. F. G. Leentjens, A. Y. Mogilner, M. H. Pourfar, L. Almeida, A. Gunduz, K. D. Foote, M. S. Okun, C. R. Butson. Structural connectivity predicts clinical outcomes of deep brain stimulation for Tourette syndrome, In Brain, July, 2020.
Deep brain stimulation may be an effective therapy for select cases of severe, treatment-refractory Tourette syndrome; however, patient responses are variable, and there are no reliable methods to predict clinical outcomes. The objectives of this retrospective study were to identify the stimulation-dependent structural networks associated with improvements in tics and comorbid obsessive-compulsive behaviour, compare the networks across surgical targets, and determine if connectivity could be used to predict clinical outcomes. Volumes of tissue activated for a large multisite cohort of patients (n = 66) implanted bilaterally in globus pallidus internus (n = 34) or centromedial thalamus (n = 32) were used to generate probabilistic tractography to form a normative structural connectome. The tractography maps were used to identify networks that were correlated with improvement in tics or comorbid obsessive-compulsive behaviour and to predict clinical outcomes across the cohort. The correlated networks were then used to generate ‘reverse’ tractography to parcellate the total volume of stimulation across all patients to identify local regions to target or avoid. The results showed that for globus pallidus internus, connectivity to limbic networks, associative networks, caudate, thalamus, and cerebellum was positively correlated with improvement in tics; the model predicted clinical improvement scores (P = 0.003) and was robust to cross-validation. Regions near the anteromedial pallidum exhibited higher connectivity to the positively correlated networks than posteroventral pallidum, and volume of tissue activated overlap with this map was significantly correlated with tic improvement (P < 0.017). For centromedial thalamus, connectivity to sensorimotor networks, parietal-temporal-occipital networks, putamen, and cerebellum was positively correlated with tic improvement; the model predicted clinical improvement scores (P = 0.012) and was robust to cross-validation. Regions in the anterior/lateral centromedial thalamus exhibited higher connectivity to the positively correlated networks, but volume of tissue activated overlap with this map did not predict improvement (P > 0.23). For obsessive-compulsive behaviour, both targets showed that connectivity to the prefrontal cortex, orbitofrontal cortex, and cingulate cortex was positively correlated with improvement; however, only the centromedial thalamus maps predicted clinical outcomes across the cohort (P = 0.034), but the model was not robust to cross-validation. Collectively, the results demonstrate that the structural connectivity of the site of stimulation are likely important for mediating symptom improvement, and the networks involved in tic improvement may differ across surgical targets. These networks provide important insight on potential mechanisms and could be used to guide lead placement and stimulation parameter selection, as well as refine targets for neuromodulation therapies for Tourette syndrome.
B. Kundu, T. S. Davis, B. Philip, E. H. Smith, A. Arain, A. Peters, B. Newman, C. R. Butson, J. D. Rolston. A systematic exploration of parameters affecting evoked intracranial potentials in patients with epilepsy, In Brain Stimulation, Vol. 13, No. 5, pp. 1232-1244. 2020.
Brain activity is constrained by and evolves over a network of structural and functional connections. Corticocortical evoked potentials (CCEPs) have been used to measure this connectivity and to discern brain areas involved in both brain function and disease. However, how varying stimulation parameters influences the measured CCEP across brain areas has not been well characterized.
To better understand the factors that influence the amplitude of the CCEPs as well as evoked gamma-band power (70–150 Hz) resulting from single-pulse stimulation via cortical surface and depth electrodes.
CCEPs from 4370 stimulation-response channel pairs were recorded across a range of stimulation parameters and brain regions in 11 patients undergoing long-term monitoring for epilepsy. A generalized mixed-effects model was used to model cortical response amplitudes from 5 to 100 ms post-stimulation.
Stimulation levels <5.5 mA generated variable CCEPs with low amplitude and reduced spatial spread. Stimulation at ≥5.5 mA yielded a reliable and maximal CCEP across stimulation-response pairs over all regions. These findings were similar when examining the evoked gamma-band power. The amplitude of both measures was inversely correlated with distance. CCEPs and evoked gamma power were largest when measured in the hippocampus compared with other areas. Larger CCEP size and evoked gamma power were measured within the seizure onset zone compared with outside this zone.
These results will help guide future stimulation protocols directed at quantifying network connectivity across cognitive and disease states.
C. Ly, C. Vachet, I. Schwerdt, E. Abbott, A. Brenkmann, L.W. McDonald, T. Tasdizen. Determining uranium ore concentrates and their calcination products via image classification of multiple magnifications, In Journal of Nuclear Materials, 2020.
Many tools, such as mass spectrometry, X-ray diffraction, X-ray fluorescence, ion chromatography, etc., are currently available to scientists investigating interdicted nuclear material. These tools provide an analysis of physical, chemical, or isotopic characteristics of the seized material to identify its origin. In this study, a novel technique that characterizes physical attributes is proposed to provide insight into the processing route of unknown uranium ore concentrates (UOCs) and their calcination products. In particular, this study focuses on the characteristics of the surface structure captured in scanning electron microscopy (SEM) images at different magnification levels. Twelve common commercial processing routes of UOCs and their calcination products are investigated. Multiple-input single-output (MISO) convolution neural networks (CNNs) are implemented to differentiate the processing routes. The proposed technique can determine the processing route of a given sample in under a second running on a graphics processing unit (GPU) with an accuracy of more than 95%. The accuracy and speed of this proposed technique enable nuclear scientists to provide the preliminary identification results of interdicted material in a short time period. Furthermore, this proposed technique uses a predetermined set of magnifications, which in turn eliminates the human bias in selecting the magnification during the image acquisition process.
S. Zellmann, M. Aumüller, N. Marshak, I. Wald. High-Quality Rendering of Glyphs Using Hardware-Accelerated Ray Tracing, In Eurographics Symposium on Parallel Graphics and Visualization (EGPGV), The Eurographics Association, 2020.
Glyph rendering is an important scientific visualization technique for 3D, time-varying simulation data and for higherdimensional data in general. Though conceptually simple, there are several different challenges when realizing glyph rendering on top of triangle rasterization APIs, such as possibly prohibitive polygon counts, limitations of what shapes can be used for the glyphs, issues with visual clutter, etc. In this paper, we investigate the use of hardware ray tracing for high-quality, highperformance glyph rendering, and show that this not only leads to a more flexible and often more elegant solution for dealing with number and shape of glyphs, but that this can also help address visual clutter, and even provide additional visual cues that can enhance understanding of the dataset.
T. A. J. Ouermi, R. M. Kirby, M. Berzins. Numerical Testing of a New Positivity-Preserving Interpolation Algorithm, Subtitled arXiv, 2020.
An important component of a number of computational modeling algorithms is an interpolation method that preserves the positivity of the function being interpolated. This report describes the numerical testing of a new positivity-preserving algorithm that is designed to be used when interpolating from a solution defined on one grid to different spatial grid. The motivating application is a numerical weather prediction (NWP) code that uses spectral elements as the discretization choice for its dynamics core and Cartesian product meshes for the evaluation of its physics routines. This combination of spectral elements, which use nonuniformly spaced quadrature/collocation points, and uniformly-spaced Cartesian meshes combined with the desire to maintain positivity when moving between these necessitates our work. This new approach is evaluated against several typical algorithms in use on a range of test problems in one or more space dimensions. The results obtained show that the new method is competitive in terms of observed accuracy while at the same time preserving the underlying positivity of the functions being interpolated.
V. Pascucci, I. Altintas, J. Fortes, I. Foster, H. Gu, S. Hariri, D. Stanzione, M. Taufer, X. Zhao. Report from the NSF Workshop on Smart Cyberinfrastructure 2020, NSF, 2020.
Machine learning and other Artifical Intelligenece technologies (all indicated in the following as AI) used within a modern, smart cyberinfrastructure have become critical new avenues for discovery and validation in data-driven science and engineering disciplines of all kinds. We can expect many landmark discoveries and new lines of productive research to be enabled through AI analysis of the rapidly growing treasure trove of scientific data. AI-based techniques have been applied in many fields of science and engineering, including remote sensing, cosmology, energy, cancer research, IT systems management, and machine design and control, but the lack of proper integration with the current NSF-supported cyberinfrastructure is limiting their potential. Recent events due to the COVID-19 pandemic have highlighted how cyberinfrastructure is a crucial enabler of modern research, with massive simulations and data management capabilities [8-10], but these events have also emphasized how the lack of proper integration with AI technology remains a major limiting factor for the advancement of science and engineering, especially when any kind of rapid response is needed.
S. P. Ponnapalli, M. W. Bradley, K. Devine, J. Bowen, S. E. Coppens, K. M. Leraas, B. A. Milash, F. Li, H. Luo, S. Qiu, K. Wu, H. Yang, C. T. Wittwer, C. A. Palmer, R. L. Jensen, J. M. Gastier-Foster, H. A. Hanson, J. S. Barnholtz-Sloan, O. Alter. Retrospective clinical trial experimentally validates glioblastoma genome-wide pattern of DNA copy-number alterations predictor of survival, In Applied Physics Letters (APL) Bioengineering, Vol. 4, No. 2, May, 2020.
Modeling of genomic profiles from the Cancer Genome Atlas (TCGA) by using recently developed mathematical frameworks has associated a genome-wide pattern of DNA copy-number alterations with a shorter, roughly one-year, median survival time in glioblastoma (GBM) patients. Here, to experimentally test this relationship, we whole-genome sequenced DNA from tumor samples of patients. We show that the patients represent the U.S. adult GBM population in terms of most normal and disease phenotypes. Intratumor heterogeneity affects ≈11% and profiling technology and reference human genome specifics affect <1% of the classifications of the tumors by the pattern, where experimental batch effects normally reduce the reproducibility, i.e., precision, of classifications based upon between one to a few hundred genomic loci by >30%. With a 2.25-year Kaplan–Meier median survival difference, a 3.5 univariate Cox hazard ratio, and a 0.78 concordance index, i.e., accuracy, the pattern predicts survival better than and independent of age at diagnosis, which has been the best indicator since 1950. The prognostic classification by the pattern may, therefore, help to manage GBM pseudoprogression. The diagnostic classification may help drugs progress to regulatory approval. The therapeutic predictions, of previously unrecognized targets that are correlated with survival, may lead to new drugs. Other methods missed this relationship in the roughly 3B-nucleotide genomes of the small, order of magnitude of 100, patient cohorts, e.g., from TCGA. Previous attempts to associate GBM genotypes with patient phenotypes were unsuccessful. This is a proof of principle that the frameworks are uniquely suitable for discovering clinically actionable genotype–phenotype relationships.
D. Sahasrabudhe, M. Berzins. Improving Performance of the Hypre Iterative Solver for Uintah Combustion Codes on Manycore Architectures Using MPI Endpoints and Kernel Consolidation, In Computational Science -- ICCS 2020, 20th International Conference, Amsterdam, The Netherlands, June 3–5, 2020, Proceedings, Part I, Springer International Publishing, pp. 175--190. 2020.
The solution of large-scale combustion problems with codes such as the Arches component of Uintah on next generation computer architectures requires the use of a many and multi-core threaded approach and/or GPUs to achieve performance. Such codes often use a low-Mach number approximation, that require the iterative solution of a large system of linear equations at every time step. While the discretization routines in such a code can be improved by the use of, say, OpenMP or Cuda Approaches, it is important that the linear solver be able to perform well too. For Uintah the Hypre iterative solver has proved to solve such systems in a scalable way. The use of Hypre with OpenMP leads to at least 2x slowdowns due to OpenMP overheads, however. This behavior is analyzed and a solution proposed by using the MPI Endpoints approach is implemented within Hypre, where each team of threads acts as a different MPI rank. This approach minimized OpenMP synchronization overhead, avoided slowdowns, performed as fast or (up to 1.5x) faster than Hypre’s MPI only version, and allowed the rest of Uintah to be optimized using OpenMP. Profiling of the GPU version of Hypre showed the bottleneck to be the launch overhead of thousands of micro-kernels. The GPU performance was improved by fusing these micro kernels and was further optimized by using Cuda-aware MPI. The overall speedup of 1.26x to 1.44x was observed compared to the baseline GPU implementation.
D. Sahasrabudhe, R. Zambre, A. Chandramowlishwaran, M. Berzins. Optimizing the Hypre solver for manycore and GPU architectures, In Journal of Computational Science, Springer International Publishing, pp. 101279. 2020.
The solution of large-scale combustion problems with codes such as Uintah on modern computer architectures requires the use of multithreading and GPUs to achieve performance. Uintah uses a low-Mach number approximation that requires iteratively solving a large system of linear equations. The Hypre iterative solver has solved such systems in a scalable way for Uintah, but the use of OpenMP with Hypre leads to at least 2x slowdown due to OpenMP overheads. The proposed solution uses the MPI Endpoints within Hypre, where each team of threads acts as a different MPI rank. This approach minimizes OpenMP synchronization overhead and performs as fast or (up to 1.44x) faster than Hypre’s MPI-only version, and allows the rest of Uintah to be optimized using OpenMP. The profiling of the GPU version of Hypre shows the bottleneck to be the launch overhead of thousands of micro-kernels. The GPU performance was improved by fusing these micro-kernels and was further optimized by using Cuda-aware MPI, resulting in an overall speedup of 1.16–1.44x compared to the baseline GPU implementation.
The above optimization strategies were published in the International Conference on Computational Science 2020. This work extends the previously published research by carrying out the second phase of communication-centered optimizations in Hypre to improve its scalability on large-scale supercomputers. This includes an efficient non-blocking inter-thread communication scheme, communication-reducing patch assignment, and expression of logical communication parallelism to a new version of the MPICH library that utilizes the underlying network parallelism. The above optimizations avoid communication bottlenecks previously observed during strong scaling and improve performance by up to 2x on 256 nodes of Intel Knight’s Landing processor.