Proceedings of VI International Conference on Particle-based Methods – Fundamentals and Applications, Barcelona, Edited by E. O ̃ nate, M. Bischoff, D.R.J. Owen, P. Wriggers & T. Zohdi, PARTICLES 2019, pp. 555-566. October, 2019.
The success of the Material Point Method (MPM) in solving many challenging problems nevertheless raises some open questions regarding the fundamental properties of the method such as the energy conservation since being addressed by Bardenhagen and by Love and Sulsky. Similarly while low order symplectic time integration techniques are used with MPM, higher order methods have not been used. For this reason the Stormer Verlet method, a popular and widely-used symplectic method is applied to MPM. Both the time integration error and the energy conservation properties of this method applied to MPM are considered. The method is shown to have locally third order accuracy of energy conservation in time. This is in contrast to the locally second order accuracy in energy conservation of the methods that are used in many MPM calculations. This third accuracy accuracy is demonstrated both locally and globally on a standard MPM test example.
R. Bhalodia, S. Y. Elhabian, L. Kavan, R. T. Whitaker. A Cooperative Autoencoder for Population-Based Regularization of CNN Image Registration, In Medical Image Computing and Computer Assisted Intervention – MICCAI 2019, In Medical Image Computing and Computer Assisted Intervention -- MICCAI 2019, Springer International Publishing, pp. 391--400. 2019.
Spatial transformations are enablers in a variety of medical image analysis applications that entail aligning images to a common coordinate systems. Population analysis of such transformations is expected to capture the underlying image and shape variations, and hence these transformations are required to produce anatomically feasible correspondences. This is usually enforced through some smoothness-based generic metric or regularization of the deformation field. Alternatively, population-based regularization has been shown to produce anatomically accurate correspondences in cases where anatomically unaware (i.e., data independent) regularization fail. Recently, deep networks have been used to generate spatial transformations in an unsupervised manner, and, once trained, these networks are computationally faster and as accurate as conventional, optimization-based registration methods. However, the deformation fields produced by these networks require smoothness penalties, just as the conventional registration methods, and ignores population-level statistics of the transformations. Here, we propose a novel neural network architecture that simultaneously learns and uses the population-level statistics of the spatial transformations to regularize the neural networks for unsupervised image registration. This regularization is in the form of a bottleneck autoencoder, which learns and adapts to the population of transformations required to align input images by encoding the transformations to a low dimensional manifold. The proposed architecture produces deformation fields that describe the population-level features and associated correspondences in an anatomically relevant manner and are statistically compact relative to the state-of-the-art approaches while maintaining computational efficiency. We demonstrate the efficacy of the proposed architecture on synthetic data sets, as well as 2D and 3D medical data.
Computational models are a popular tool for predicting the effects of deep brain stimulation (DBS) on neural tissue. One commonly used model, the volume of tissue activated (VTA), is computed using multiple methodologies. We quantified differences in the VTAs generated by five methodologies: the traditional axon model method, the electric field norm, and three activating function based approaches - the activating function at each grid point in the tangential direction (AF-Tan) or in the maximally activating direction (AF-3D), and the maximum activating function along the entire length of a tangential fiber (AF-Max).
Approach: We computed the VTA using each method across multiple stimulation settings. The resulting volumes were compared for similarity, and the methodologies were analyzed for their differences in behavior.
Main Results: Activation threshold values for both the electric field norm and the activating function vary with regards to electrode configuration, pulse width, and frequency. All methods produced highly similar volumes for monopolar stimulation. For bipolar electrode configurations, only the maximum activating function along the tangential axon method, AF-Max, produced similar volumes to those produced by the axon model method. Further analysis revealed that both of these methods are biased by their exclusive use of tangential fiber orientations. In contrast, the activating function in the maximally activating direction method, AF-3D, produces a VTA that is free of axon orientation and projection bias.
Significance: Simulating tangentially oriented axons, the standard approach of computing the VTA, is too computationally expensive for widespread implementation and yields results biased by the assumption of tangential fiber orientation. In this work, we show that a computationally efficient method based on the activating function, AF-Max, reliably reproduces the VTAs generated by direct axon modeling. Further, we propose another method, AF-3D as a potentially superior model for representing generic neural tissue activation.
Topological techniques have proven to be a powerful tool in the analysis and visualization of large-scale scientific data. In particular, the Morse-Smale complex and its various components provide a rich framework for robust feature definition and computation. Consequently, there now exist a number of approaches to compute Morse-Smale complexes for large-scale data in parallel. However, existing techniques are based on discrete concepts which produce the correct topological structure but are known to introduce grid artifacts in the resulting geometry. Here, we present a new approach that combines parallel streamline computation with combinatorial methods to construct a high-quality discrete Morse-Smale complex. In addition to being invariant to the orientation of the underlying grid, this algorithm allows users to selectively build a subset of features using high-quality geometry. In particular, a user may specifically select which ascending/descending manifolds are reconstructed with improved accuracy, focusing computational effort where it matters for subsequent analysis. This approach computes Morse-Smale complexes for larger data than previously feasible with significant speedups. We demonstrate and validate our approach using several examples from a variety of different scientific domains, and evaluate the performance of our method.
M. Han, I. Wald, W. Usher, Q. Wu, F. Wang, V. Pascicci, C. D. Hansen, C. R. Johnson. Ray Tracing Generalized Tube Primitives: Method and Applications, In Computer Graphics Forum, Vol. 38, No. 3, John Wiley & Sons Ltd., 2019.
We present a general high-performance technique for ray tracing generalized tube primitives. Our technique efficiently supports tube primitives with fixed and varying radii, general acyclic graph structures with bifurcations, and correct transparency with interior surface removal. Such tube primitives are widely used in scientific visualization to represent diffusion tensor imaging tractographies, neuron morphologies, and scalar or vector fields of 3D flow. We implement our approach within the OSPRay ray tracing framework, and evaluate it on a range of interactive visualization use cases of fixed- and varying-radius streamlines, pathlines, complex neuron morphologies, and brain tractographies. Our proposed approach provides interactive, high-quality rendering, with low memory overhead.
Quantifying Impurity Effects on the Surface Morphology of α-U3O8, In Analytical Chemistry, 2019.
The morphological effect of impurities on α-U3O8 has been investigated. This study provides the first evidence that the presence of impurities can alter nuclear material morphology, and these changes can be quantified to aid in revealing processing history. Four elements: Ca, Mg, V, and Zr were implemented in the uranyl peroxide synthesis route and studied individually within the α-U3O8. Six total replicates were synthesized, and replicates 1–3 were filtered and washed with Millipore water (18.2 MΩ) to remove any residual nitrates. Replicates 4–6 were filtered but not washed to determine the amount of impurities removed during washing. Inductively coupled plasma mass spectrometry (ICP-MS) was employed at key points during the synthesis to quantify incorporation of the impurity. Each sample was characterized using powder X-ray diffraction (p-XRD), high-resolution scanning electron microscopy (HRSEM), and SEM with energy dispersive X-ray spectroscopy (SEM-EDS). p-XRD was utilized to evaluate any crystallographic changes due to the impurities; HRSEM imagery was analyzed with Morphological Analysis for MAterials (MAMA) software and machine learning classification for quantification of the morphology; and SEM-EDS was utilized to locate the impurity within the α-U3O8. All samples were found to be quantifiably distinguishable, further demonstrating the utility of quantitative morphology as a signature for the processing history of nuclear material.
In the present study, surface morphological differences of mixtures of triuranium octoxide (U3O8), synthesized from uranyl peroxide (UO4) and ammonium diuranate (ADU), were investigated. The purity of each sample was verified using powder X-ray diffractometry (p-XRD), and scanning electron microscopy (SEM) images were collected to identify unique morphological features. The U3O8 from ADU and UO4 was found to be unique. Qualitatively, both particles have similar features being primarily circular in shape. Using the morphological analysis of materials (MAMA) software, particle shape and size were quantified. UO4 was found to produce U3O8 particles three times the area of those produced from ADU. With the starting morphologies quantified, U3O8 samples from ADU and UO4 were physically mixed in known quantities. SEM images were collected of the mixed samples, and the MAMA software was used to quantify particle attributes. As U3O8 particles from ADU were unique from UO4, the composition of the mixtures could be quantified using SEM imaging coupled with particle analysis. This provides a novel means of quantifying processing histories of mixtures of uranium oxides. Machine learning was also used to help further quantify characteristics in the image database through direct classification and particle segmentation using deep learning techniques based on Convolutional Neural Networks (CNN). It demonstrates that these techniques can distinguish the mixtures with high accuracy as well as showing significant differences in morphology between the mixtures. Results from this study demonstrate the power of quantitative morphological analysis for determining the processing history of nuclear materials.
There currently exist two dominant strategies to reduce data sizes in analysis and visualization: reducing the precision of the data, e.g., through quantization, or reducing its resolution, e.g., by subsampling. Both have advantages and disadvantages and both face fundamental limits at which the reduced information ceases to be useful. The paper explores the additional gains that could be achieved by combining both strategies. In particular, we present a common framework that allows us to study the trade-off in reducing precision and/or resolution in a principled manner. We represent data reduction schemes as progressive streams of bits and study how various bit orderings such as by resolution, by precision, etc., impact the resulting approximation error across a variety of data sets as well as analysis tasks. Furthermore, we compute streams that are optimized for different tasks to serve as lower bounds on the achievable error. Scientific data management systems can use the results presented in this paper as guidance on how to store and stream data to make efficient use of the limited storage and bandwidth in practice.
J. K. Holmen, B. Peterson, A. Humphrey, D. Sunderland, O. H. Diaz-Ibarra, J. N. Thornock, M. Berzins. Portably Improving Uintah's Readiness for Exascale Systems Through the Use of Kokkos, SCI Institute, 2019.
Uncertainty and diversity in future HPC systems, including those for exascale, makes portable codebases desirable. To ease future ports, the Uintah Computational Framework has adopted the Kokkos C++ Performance Portability Library. This paper describes infrastructure advancements and performance improvements using partitioning functionality recently added to Kokkos within Uintah's MPI+Kokkos hybrid parallelism approach. Results are presented for two challenging calculations that have been refactored to support Kokkos::OpenMP and Kokkos::Cuda back-ends. These results demonstrate performance improvements up to (i) 2.66x when refactoring for portability, (ii) 81.59x when adding loop-level parallelism via Kokkos back-ends, and (iii) 2.63x when more eciently using a node. Good strong-scaling characteristics to 442,368 threads across 1728 Knights Landing processors are also shown. These improvements have been achieved with little added overhead (sub-millisecond, consuming up to 0.18% of per-timestep time). Kokkos adoption and refactoring lessons are also discussed.
J. K. Holmen, B. Peterson, M. Berzins. An Approach for Indirectly Adopting a Performance Portability Layer in Large Legacy Codes, In 2nd International Workshop on Performance, Portability, and Productivity in HPC (P3HPC), In conjunction with SC19, 2019.
Diversity among supported architectures in current and emerging high performance computing systems, including those for exascale, makes portable codebases desirable. Portability of a codebase can be improved using a performance portability layer to provide access to multiple underlying programming models through a single interface. Direct adoption of a performance portability layer, however, poses challenges for large pre-existing software frameworks that may need to preserve legacy code and/or adopt other programming models in the future. This paper describes an approach for indirect adoption that introduces a framework-specific portability layer between the application developer and the adopted performance portability layer to help improve legacy code support and long-term portability for future architectures and programming models. This intermediate layer uses loop-level, application-level, and build-level components to ease adoption of a performance portability layer in large legacy codebases. Results are shown for two challenging case studies using this approach to make portable use of OpenMP and CUDA via Kokkos in an asynchronous many-task runtime system, Uintah. These results show performance improvements up to 2.7x when refactoring for portability and 2.6x when more efficiently using a node. Good strong-scaling to 442,368 threads across 1,728 Knights Landing processors are also shown using MPI+Kokkos at scale.
Alan Humphrey. Scalable Asynchronous Many-Task Runtime Solutions to Globally Coupled Problems, School of Computing, University of Utah, 2019.
Thermal radiation is an important physical process and a key mechanism in a class of challenging engineering and research problems. The principal exascale-candidate application motivating this research is a large eddy simulation (LES) aimed at predicting the performance of a commercial, 1200 MWe ultra-super critical (USC) coal boiler, with radiation as the dominant mode of heat transfer. Scalable modeling of radiation is currently one of the most challenging problems in large-scale simulations, due to the global, all-to-all physical and resulting computational connectivity. Fundamentally, radiation models impose global data dependencies, requiring each compute node in a distributed memory system to send data to, and receive data from, potentially every other node. This process can be prohibitively expensive on large distributed memory systems due to pervasive all-to-all message passing interface (MPI) communication. Correctness is also difficult to achieve when coordinating global communication of this kind. Asynchronous many-task (AMT) runtime systems are a possible leading alternative to mitigate programming challenges at the runtime system-level, sheltering the application developer from the complexities introduced by future architectures. However, large-scale parallel applications with complex global data dependencies, such as in radiation modeling, pose significant scalability challenges themselves, even for a highly tuned AMT runtime. The principal aims of this research are to demonstrate how the Uintah AMT runtime can be adapted, making it possible for complex multiphysics applications with radiation to scale on current petascale and emerging exascale architectures. For Uintah, which uses a directed acyclic graph to represent the computation and associated data dependencies, these aims are achieved through: 1) the use of an AMT runtime; 2) adapting and leveraging Uintah’s adaptive mesh refinement support to dramatically reduce computation, communication volume, and nodal memory footprint for radiation calculations; and 3) automating the all-to-all communication at the runtime level through a task graph dependency analysis phase designed to efficiently manage data dependencies inherent in globally coupled problems.
A. Humphrey, M. Berzins.
An Evaluation of An Asynchronous Task Based Dataflow Approach For Uintah, In 2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC), Vol. 2, pp. 652-657. July, 2019.
The challenge of running complex physics code on the largest computers available has led to dataflow paradigms being explored. While such approaches are often applied at smaller scales, the challenge of extreme-scale data flow computing remains. The Uintah dataflow framework has consistently used dataflow computing at the largest scales on complex physics applications. At present Uintah contains two main dataflow models. Both are based upon asynchronous communication. One uses a static graph-based approach with asynchronous communication and the other uses a more dynamic approach that was introduced almost a decade ago. Subsequent changes within the Uintah runtime system combined with many more large scale experiments, has necessitated a reevaluation of these two approaches, comparing them in the context of large scale problems. While the static approach has worked well for some large-scale simulations, the dynamic approach is seen to offer performance improvements over the static case for a challenging fluid-structure interaction problem at large scale that involves fluid flow and a moving solid represented using particle method on an adaptive mesh.
Deep brain stimulation (DBS) can be an effective therapy for tics and comorbidities in select cases of severe, treatment-refractory Tourette syndrome (TS). Clinical responses remain variable across patients, which may be attributed to differences in the location of the neuroanatomical regions being stimulated. We evaluated active contact locations and regions of stimulation across a large cohort of patients with TS in an effort to guide future targeting.
Adversarial regression training for visualizing the progression of chronic obstructive pulmonary disease with chest x-rays, In Arxiv, In International Conference on Medical Image Computing and Computer-Assisted Intervention, 2019.R.B. Lanfredi, J.D. Schroeder, C. Vachet, T. Tasdizen.
Knowledge of what spatial elements of medical images deep learning methods use as evidence is important for model interpretability, trustiness, and validation. There is a lack of such techniques for models in regression tasks. We propose a method, called visualization for regression with a generative adversarial network (VR-GAN), for formulating adversarial training specifically for datasets containing regression target values characterizing disease severity. We use a conditional generative adversarial network where the generator attempts to learn to shift the output of a regressor through creating disease effect maps that are added to the original images. Meanwhile, the regressor is trained to predict the original regression value for the modified images. A model trained with this technique learns to provide visualization for how the image would appear at different stages of the disease. We analyze our method in a dataset of chest x-rays associated with pulmonary function tests, used for diagnosing chronic obstructive pulmonary disease (COPD). For validation, we compute the difference of two registered x-rays of the same patient at different time points and correlate it to the generated disease effect map. The proposed method outperforms a technique based on classification and provides realistic-looking images, making modifications to images following what radiologists usually observe for this disease. Implementation code is available athttps://github.com/ricbl/vrgan.
B. Peterson. Portable and Performant GPU/Heterogeneous Asynchronous Many-task Runtime System, Subtitled Ph.D. Dissertation, University of Utah, School of Computing, Dec, 2019.
Asynchronous many-task (AMT) runtimes are maturing as a model for computing simulations on a diverse range of architectures at large-scale. The Uintah AMT framework is driven by a philosophy of maintaining an application layer distinct from the underlying runtime while operating on an adaptive mesh grid. This model has enabled task devel-opers to focus on writing task code while minimizing their interaction with MPI transfers, halo processing, data stores, coherency of simulation variables, and proper ordering of task execution. Further, Uintah is implementing an architecture portable solution by utilizing the Kokkos programming portability layer so that application tasks can be written in one codebase and performantly executed on CPUs, GPUs, Intel Xeon Phis, and other future architectures.
Of these architectures, it is perhaps Nvidia GPUs that introduce the greatest usability and portability challenges for AMT runtimes. Specifically, Nvidia GPUs require code to adhere to a proprietary programming model, use separate high capacity memory, utilize asynchrony of data movement and execution, and partition execution units among many streaming multiprocessors. Numerous novel solutions to both Uintah and Kokkos are required to abstract these GPU features into an AMT runtime while preserving an appli-cation layer and enabling portability.
The focus of this AMT research is largely split into two main parts, performance and portability. Runtime performance comes from 1) minimizing runtime overhead when preparing simulation variables for tasks prior to execution, and 2) executing a hetero-geneous mixture of tasks to keep compute node processing units busy. Preparation of simulation variables, especially halo processing, receives significant emphasis as Uintah’s target problems heavily rely on local and global halos. In addition, this work covers automated data movement of simulation variables between host and GPU memory as well as distributing tasks throughout a GPU for execution.
Portability is a productivity necessity as application developers struggle to maintain three sets of code per task, namely code for single CPU core execution, CUDA code for GPU tasks, and a third set of code for Xeon Phi parallel execution. Programming portability layers, such as Kokkos, provide a framework for this portability, however, Kokkos itself requires modifications to support GPU execution of finer grained tasks typical of AMT runtimes like Uintah. Currently, Kokkos GPU parallel loop execution is bulk-synchronous. This research demonstrates a model for portable loops that is asynchronous, nonblocking, and performant. Additionally, integrating GPU portability into Uintah required additional modifications to aid the application developer in avoiding Kokkos specific details.
This research concludes by demonstrating a GPU-enabled AMT runtime that is both performant and portable. Further, application developers are not burdened with additional architecture specific requirements. Results are demonstrated using production task codebases written for CPUs, GPUs, and Kokkos portability and executed in GPU homogeneous and CPU/GPU heterogeneous environments.
Personalized Virtual-heart Technology for Guiding the Ablation of Infarct-related Ventricular Tachycardia, In Nature Biomedical Engineering, Vol. 2, pp. 732–740. 2019.
Ventricular tachycardia (VT), which can lead to sudden cardiac death, occurs frequently in patients with myocardial infarction. Catheter-based radio-frequency ablation of cardiac tissue has achieved only modest efficacy, owing to the inaccurate identification of ablation targets by current electrical mapping techniques, which can lead to extensive lesions and to a prolonged, poorly tolerated procedure. Here, we show that personalized virtual-heart technology based on cardiac imaging and computational modelling can identify optimal infarct-related VT ablation targets in retrospective animal (five swine) and human studies (21 patients), as well as in a prospective feasibility study (five patients). We first assessed, using retrospective studies (one of which included a proportion of clinical images with artefacts), the capability of the technology to determine the minimum-size ablation targets for eradicating all VTs. In the prospective study, VT sites predicted by the technology were targeted directly, without relying on prior electrical mapping. The approach could improve infarct-related VT ablation guidance, where accurate identification of patient-specific optimal targets could be achieved on a personalized virtual heart before the clinical procedure.
D. Sahasrabudhe, M. Berzins, J. Schmidt.
Node failure resiliency for Uintah without checkpointing, In Concurrency and Computation: Practice and Experience, pp. e5340. 2019.
The frequency of failures in upcoming exascale supercomputers may well be greater than at present due to many-core architectures if component failure rates remain unchanged. This potential increase in failure frequency coupled with I/O challenges at exascale may prove problematic for current resiliency approaches such as checkpoint restarting, although the use of fast intermediate memory may help. Algorithm-Based Fault Tolerance (ABFT) using Adaptive Mesh Refinement (AMR) is one resiliency approach used to address these challenges. For adaptive mesh codes, a coarse mesh version of the solution may be used to restore the fine mesh solution. This paper addresses the implementation of the ABFT approach within the Uintah software framework: both at a software level within Uintah and in the data reconstruction method used for the recovery of lost data. This method has two problems: inaccuracies introduced during the reconstruction propagate forward in time, and the physical consistency of variables such as positivity or boundedness may be violated during interpolation. These challenges can be addressed by the combination of two techniques: 1. a fault-tolerant MPI implementation to recover from runtime node failures, and 2. high-order interpolation schemes to preserve the physical solution and reconstruct lost data. The approach considered here uses a "Limited Essentially Non-Oscillatory" (LENO) scheme along with AMR to rebuild the lost data without checkpointing using Uintah. Experiments were carried out using a fault-tolerant MPI - ULFM to recover from runtime failure, and LENO to recover data on patches belonging to failed ranks, while the simulation was continued to the end. Results show that this ABFT approach is up to 10x faster than the traditional checkpointing method. The new interpolation approach is more accurate than linear interpolation and not subject to the overshoots found in other interpolation methods.
D. Sahasrabudhe, E. T. Phipps, S. Rajamanickam, M. Berzins. A Portable SIMD Primitive using Kokkos for Heterogeneous Architectures, In Sixth Workshop on Accelerator Programming Using Directives (WACCPD), 2019.
As computer architectures are rapidly evolving (e.g. those designed for exascale), multiple portability frameworks have been developed to avoid new architecture-specific development and tuning. However, portability frameworks depend on compilers for auto-vectorization and may lack support for explicit vectorization on heterogeneous platforms. Alternatively, programmers can use intrinsics-based primitives to achieve more efficient vectorization, but the lack of a gpu back-end for these primitives makes such code non-portable. A unified, portable, Single Instruction Multiple Data (simd) primitive proposed in this work, allows intrinsics-based vectorization on cpus and many-core architectures such as Intel Knights Landing (knl), and also facilitates Single Instruction Multiple Threads (simt) based execution on gpus. This unified primitive, coupled with the Kokkos portability ecosystem, makes it possible to develop explicitly vectorized code, which is portable across heterogeneous platforms. The new simd primitive is used on different architectures to test the performance boost against hard-to-auto-vectorize baseline, to measure the overhead against efficiently vectroized baseline, and to evaluate the new feature called the \logical vector length" (lvl). The simd primitive provides portability across cpus and gpus without any performance degradation being observed experimentally.
As application scientists develop and deploy simulation codes on to leadership-class computing resources, there is a need to instrument these codes to better understand performance to efficiently utilize these resources. This instrumentation may come from independent third-party tools that generate and store performance metrics or from custom instrumentation tools built directly into the application. The metrics collected are then available for visual analysis, typically in the domain in which there were collected. In this paper, we introduce an approach to visualize and analyze the performance metrics in situ in the context of the machine, application, and communication domains (MAC model) using a single visualization tool. This visualization model provides a holistic view of the application performance in the context of the resources where it is executing.