SCI Publications
2023
T. Kataria, B. Knudsen, S. Elhabian.
To pretrain or not to pretrain? A case study of domain-specific pretraining for semantic segmentation in histopathology, Subtitled arXiv:2307.03275, 2023.
Annotating medical imaging datasets is costly, so fine-tuning (or transfer learning) is the most effective method for digital pathology vision applications such as disease classification and semantic segmentation. However, due to texture bias in models trained on real-world images, transfer learning for histopathology applications might result in underperforming models, which necessitates the need for using unlabeled histopathology data and self-supervised methods to discover domain-specific characteristics. Here, we tested the premise that histopathology-specific pretrained models provide better initializations for pathology vision tasks, i.e., gland and cell segmentation. In this study, we compare the performance of gland and cell segmentation tasks with domain-specific and non-domain-specific pretrained weights. Moreover, we investigate the data size at which domain-specific pretraining produces a statistically significant difference in performance. In addition, we investigated whether domain-specific initialization improves the effectiveness of out-of-domain testing on distinct datasets but the same task. The results indicate that performance gain using domain-specific pretraining depends on both the task and the size of the training dataset. In instances with limited dataset sizes, a significant improvement in gland segmentation performance was also observed, whereas models trained on cell segmentation datasets exhibit no improvement.
S. Leventhal, A. Gyulassy, M. Heimann, V. Pascucci.
Exploring Classification of Topological Priors with Machine Learning for Feature Extraction, In IEEE Transactions on Visualization and Computer Graphics, pp. 1--12. 2023.
In many scientific endeavors, increasingly abstract representations of data allow for new interpretive methodologies and conceptualization of phenomena. For example, moving from raw imaged pixels to segmented and reconstructed objects allows researchers new insights and means to direct their studies toward relevant areas. Thus, the development of new and improved methods for segmentation remains an active area of research. With advances in machine learning and neural networks, scientists have been focused on employing deep neural networks such as U-Net to obtain pixel-level segmentations, namely, defining associations between pixels and corresponding/referent objects and gathering those objects afterward. Topological analysis, such as the use of the Morse-Smale complex to encode regions of uniform gradient flow behavior, offers an alternative approach: first, create geometric priors, and then apply machine learning to classify. This approach is empirically motivated since phenomena of interest often appear as subsets of topological priors in many applications. Using topological elements not only reduces the learning space but also introduces the ability to use learnable geometries and connectivity to aid the classification of the segmentation target. In this paper, we describe an approach to creating learnable topological elements, explore the application of ML techniques to classification tasks in a number of areas, and demonstrate this approach as a viable alternative to pixel-level classification, with similar accuracy, improved execution time, and requiring marginal training data.
J. Li, A. Pepe, C. Gsaxner, G. Luijten, Y. Jin, S. Elhabian, et. al..
MedShapeNet - A Large-Scale Dataset of 3D Medical Shapes for Computer Vision, Subtitled arXiv:2308.16139v3, 2023.
We present MedShapeNet, a large collection of anatomical shapes (e.g., bones, organs, vessels) and 3D surgical instrument models. Prior to the deep learning era, the broad application of statistical shape models (SSMs) in medical image analysis is evidence that shapes have been commonly used to describe medical data. Nowadays, however, state-of-the-art (SOTA) deep learning algorithms in medical imaging are predominantly voxel-based. In computer vision, on the contrary, shapes (including, voxel occupancy grids, meshes, point clouds and implicit surface models) are preferred data representations in 3D, as seen from the numerous shape-related publications in premier vision conferences, such as the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), as well as the increasing popularity of ShapeNet (about 51,300 models) and Princeton ModelNet (127,915 models) in computer vision research. MedShapeNet is created as an alternative to these commonly used shape benchmarks to facilitate the translation of data-driven vision algorithms to medical applications, and it extends the opportunities to adapt SOTA vision algorithms to solve critical medical problems. Besides, the majority of the medical shapes in MedShapeNet are modeled directly on the imaging data of real patients, and therefore it complements well existing shape benchmarks consisting of computer-aided design (CAD) models. MedShapeNet currently includes more than 100,000 medical shapes, and provides annotations in the form of paired data. It is therefore also a freely available repository of 3D models for extended reality (virtual reality - VR, augmented reality - AR, mixed reality - MR) and medical 3D printing. This white paper describes in detail the motivations behind MedShapeNet, the shape acquisition procedures, the use cases, as well as the usage of the online shape search portal: https://medshapenet.ikim.nrw/
S. Li, X. Yu, W. Xing, R.M. Kirby, A. Narayan, S. Zhe.
Multi-Resolution Active Learning of Fourier Neural Operators, Subtitled arXiv:2309.16971, 2023.
Fourier Neural Operator (FNO) is a popular operator learning framework. It not only achieves the state-of-the-art performance in many tasks, but also is highly efficient in training and prediction. However, collecting training data for the FNO can be a costly bottleneck in practice, because it often demands expensive physical simulations. To overcome this problem, we propose Multi-Resolution Active learning of FNO (MRA-FNO), which can dynamically select the input functions and resolutions to lower the data cost as much as possible while optimizing the learning efficiency. Specifically, we propose a probabilistic multi-resolution FNO and use ensemble Monte-Carlo to develop an effective posterior inference algorithm. To conduct active learning, we maximize a utility-cost ratio as the acquisition function to acquire new examples and resolutions at each step. We use moment matching and the matrix determinant lemma to enable tractable, efficient utility computation. Furthermore, we develop a cost annealing framework to avoid over-penalizing high-resolution queries at the early stage. The over-penalization is severe when the cost difference is significant between the resolutions, which renders active learning often stuck at low-resolution queries and inferior performance. Our method overcomes this problem and applies to general multi-fidelity active learning and optimization problems. We have shown the advantage of our method in several benchmark operator learning tasks.
Z. Li, S. Liu, K. Bhavya, T. Bremer, V. Pascucci.
Instance-wise Linearization of Neural Network for Model Interpretation, Subtitled arXiv:2310.16295v1, 2023.
Neural network have achieved remarkable successes in many scientific fields. However, the interpretability of the neural network model is still a major bottlenecks to deploy such technique into our daily life. The challenge can dive into the non-linear behavior of the neural network, which rises a critical question that how a model use input feature to make a decision. The classical approach to address this challenge is feature attribution, which assigns an important score to each input feature and reveal its importance of current prediction. However, current feature attribution approaches often indicate the importance of each input feature without detail of how they are actually processed by a model internally. These attribution approaches often raise a concern that whether they highlight correct features for a model prediction.
For a neural network model, the non-linear behavior is often caused by non-linear activation units of a model. However, the computation behavior of a prediction from a neural network model is locally linear, because one prediction has only one activation pattern. Base on the observation, we propose an instance-wise linearization approach to reformulates the forward computation process of a neural network prediction. This approach reformulates different layers of convolution neural networks into linear matrix multiplication. Aggregating all layers' computation, a prediction complex convolution neural network operations can be described as a linear matrix multiplication F(x)=W⋅x+b. This equation can not only provides a feature attribution map that highlights the important of the input features but also tells how each input feature contributes to a prediction exactly. Furthermore, we discuss the application of this technique in both supervise classification and unsupervised neural network learning parametric t-SNE dimension reduction.
H. Lin, M. Lisnic, D. Akbaba, M. Meyer, A. Lex.
Here’s what you need to know about my data: Exploring Expert Knowledge’s Role in Data Analysis, 2023.
Data driven decision making has become the gold standard in science, industry, and public policy. Yet data alone, as an imperfect and partial representation of reality, is often insufficient to make good analysis decisions. Knowledge about the context of a dataset, its strengths and weaknesses, and its applicability for certain tasks is essential. In this work, we present an interview study with analysts from a wide range of domains and with varied expertise and experience inquiring about the role of contextual knowledge. We provide insights into how data is insufficient in analysts workflows and how they incorporate other sources of knowledge into their analysis. We also suggest design opportunities to better and more robustly consider both, knowledge and data in analysis processes.
M. Lisnic, A. Lex, M. Kogan.
"Yeah, this graph doesn't show that": Analysis of Online Engagement with Misleading Data Visualizations, In OSF Preprints, 2023.
Attempting to make sense of a phenomenon or crisis, social media users often share data visualizations and interpretations that can be erroneous or misleading. Prior work has studied how data visualizations can mislead, but do misleading visualizations reach a broad social media audience? And if so, do users amplify or challenge misleading interpretations? To answer these questions, we conducted a mixed-methods analysis of the public’s engagement with data visualization posts about COVID-19 on Twitter. Compared to posts with accurate visual insights, our results show that posts with misleading visualizations garner more replies in which the audiences point out nuanced fallacies and caveats in data interpretations. Based on the results of our thematic analysis of engagement, we identify and discuss important opportunities and limitations to effectively leveraging crowdsourced assessments to address data-driven misinformation.
AVA: Towards Autonomous Visualization Agents through Visual Perception-Driven Decision-Making.
S. Liu, H. Miao, Z. Li, M. Olson, V. Pascucci, P.T. Bremer, Subtitled arXiv preprint arXiv:2312.04494, 2023.
With recent advances in multi-modal foundation models, the previously text-only large language models (LLM) have evolved to incorporate visual input, opening up unprecedented opportunities for various applications in visualization. Our work explores the utilization of the visual perception ability of multi-modal LLMs to develop Autonomous Visualization Agents (AVAs) that can interpret and accomplish user-defined visualization objectives through natural language. We propose the first framework for the design of AVAs and present several usage scenarios intended to demonstrate the general applicability of the proposed paradigm. The addition of visual perception allows AVAs to act as the virtual visualization assistant for domain experts who may lack the knowledge or expertise in fine-tuning visualization outputs. Our preliminary exploration and proof-of-concept agents suggest that this approach can be widely applicable whenever the choices of appropriate visualization parameters require the interpretation of previous visual output. Feedback from unstructured interviews with experts in AI research, medical visualization, and radiology has been incorporated, highlighting the practicality and potential of AVAs. Our study indicates that AVAs represent a general paradigm for designing intelligent visualization systems that can achieve high-level visualization goals, which pave the way for developing expert-level visualization agents in the future.
D. Long, W.W. Xing, A.S. Krishnapriyan, R.M. Kirby, S. Zhe, M.W. Mahoney.
Equation Discovery with Bayesian Spike-and-Slab Priors and Efficient Kernels, Subtitled arXiv:2310.05387v1, 2023.
Discovering governing equations from data is important to many scientific and engineering applications. Despite promising successes, existing methods are still challenged by data sparsity as well as noise issues, both of which are ubiquitous in practice. Moreover, state-of-the-art methods lack uncertainty quantification and/or are costly in training. To overcome these limitations, we propose a novel equation discovery method based on Kernel learning and BAyesian Spike-and-Slab priors (KBASS). We use kernel regression to estimate the target function, which is flexible, expressive, and more robust to data sparsity and noises. We combine it with a Bayesian spike-and-slab prior — an ideal Bayesian sparse distribution — for effective operator selection and uncertainty quantification. We develop an expectation propagation expectation-maximization (EP-EM) algorithm for efficient posterior inference and function estimation. To overcome the computational challenge of kernel regression, we place the function values on a mesh and induce a Kronecker product construction, and we use tensor algebra methods to enable efficient computation and optimization. We show the significant advantages of KBASS on a list of benchmark ODE and PDE discovery tasks.
J. Luettgau, G. Scorzelli, V. Pascucci, M. Taufer.
Development of Large-Scale Scientific Cyberinfrastructure and the Growing Opportunity to Democratize Access to Platforms and Data, In Distributed, Ambient and Pervasive Interactions, Springer Nature Switzerland, pp. 378--389. 2023.
ISBN: 978-3-031-34668-2
DOI: 10.1007/978-3-031-34668-2_25
As researchers across scientific domains rapidly adopt advanced scientific computing methodologies, access to advanced cyberinfrastructure (CI) becomes a critical requirement in scientific discovery. Lowering the entry barriers to CI is a crucial challenge in interdisciplinary sciences requiring frictionless software integration, data sharing from many distributed sites, and access to heterogeneous computing platforms. In this paper, we explore how the challenge is not merely a factor of availability and affordability of computing, network, and storage technologies but rather the result of insufficient interfaces with an increasingly heterogeneous mix of computing technologies and data sources. With more distributed computation and data, scientists, educators, and students must invest their time and effort in coordinating data access and movements, often penalizing their scientific research. Investments in the interfaces’ software stack are necessary to help scientists, educators, and students across domains take advantage of advanced computational methods. To this end, we propose developing a science data fabric as the standard scientific discovery interface that seamlessly manages data dependencies within scientific workflows and CI.
J. Luettgau, H. Martinez, G. Tarcea, G. Scorzelli, V. Pascucci, M. Taufer.
Studying Latency and Throughput Constraints for Geo-Distributed Data in the National Science Data Fabric, In Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing, ACM, pp. 325–326. 2023.
DOI: 10.1145/3588195.3595948
The National Science Data Fabric (NSDF) is our solution to the problem of addressing the data-sharing needs of the growing data science community. NSDF is designed to make sharing data across geographically distributed sites easier for users who lack technical expertise and infrastructure. By developing an easy-to-install software stack, we promote the FAIR data-sharing principles in NSDF while leveraging existing high-speed data transfer infrastructures such as Globus and XRootD. This work shows how we leverage latency and throughput information between geo-distributed NSDF sites with NSDF entry points to optimize the automatic coordination of data placement and transfer across the data fabric, which can further improve the efficiency of data sharing.
C. Ly, C. Nizinski, A. Hagen, L. McDonald IV, T. Tasdizen.
Improving Robustness for Model Discerning Synthesis Process of Uranium Oxide with Unsupervised Domain Adaptation, In Frontiers in Nuclear Engineering, 2023.
The quantitative characterization of surface structures captured in scanning electron microscopy (SEM) images has proven to be effective for discerning provenance of an unknown nuclear material. Recently, many works have taken advantage of the powerful performance of convolutional neural networks (CNNs) to provide faster and more consistent characterization of surface structures. However, one inherent limitation of CNNs is their degradation in performance when encountering discrepancy between training and test datasets, which limits their use widely.The common discrepancy in an SEM image dataset occurs at low-level image information due to user-bias in selecting acquisition parameters and microscopes from different manufacturers.Therefore, in this study, we present a domain adaptation framework to improve robustness of CNNs against the discrepancy in low-level image information. Furthermore, our proposed approach makes use of only unlabeled test samples to adapt a pretrained model, which is more suitable for nuclear forensics application for which obtaining both training and test datasets simultaneously is a challenge due to data sensitivity. Through extensive experiments, we demonstrate that our proposed approach effectively improves the performance of a model by at least 18% when encountering domain discrepancy, and can be deployed in many CNN architectures.
L.W. McDonald IV, K. Sentz, A. Hagen, B.W. Chung, T. Tasdizen, et. al..
Review of Multi-Faceted Morphologic Signatures of Actinide Process Materials for Nuclear Forensic Science, In Journal of Nuclear Materials, Elsevier, 2023.
Particle morphology is an emerging signature that has the potential to identify the processing history of unknown nuclear materials. Using readily available scanning electron microscopes (SEM), the morphology of nearly any solid material can be measured within hours. Coupled with robust image analysis and classification methods, the morphological features can be quantified and support identification of the processing history of unknown nuclear materials. The viability of this signature depends on developing databases of morphological features, coupled with a rapid data analysis and accurate classification process. With developed reference methods, datasets, and throughputs, morphological analysis can be applied within days to (i) interdicted bulk nuclear materials (gram to kilogram quantities), and (ii) trace amounts of nuclear materials detected on swipes or environmental samples. This review aims to develop validated and verified analytical strategies for morphological analysis relevant to nuclear forensics.
N. Morrical, S. Zellmann, A. Sahistan, P. Shriwise, V. Pascucci.
Attribute-Aware RBFs: Interactive Visualization of Time Series Particle Volumes Using RT Core Range Queries, In IEEE Trans Vis Comput Graph, IEEE, 2023.
DOI: 10.1109/TVCG.2023.3327366
Supplemental material
Smoothed-particle hydrodynamics (SPH) is a mesh-free method used to simulate volumetric media in fluids, astrophysics, and solid mechanics. Visualizing these simulations is problematic because these datasets often contain millions, if not billions of particles carrying physical attributes and moving over time. Radial basis functions (RBFs) are used to model particles, and overlapping particles are interpolated to reconstruct a high-quality volumetric field; however, this interpolation process is expensive and makes interactive visualization difficult. Existing RBF interpolation schemes do not account for color-mapped attributes and are instead constrained to visualizing just the density field. To address these challenges, we exploit ray tracing cores in modern GPU architectures to accelerate scalar field reconstruction. We use a novel RBF interpolation scheme to integrate per-particle colors and densities, and leverage GPU-parallel tree construction and refitting to quickly update the tree as the simulation animates over time or when the user manipulates particle radii. We also propose a Hilbert reordering scheme to cluster particles together at the leaves of the tree to reduce tree memory consumption. Finally, we reduce the noise of volumetric shadows by adopting a spatially temporal blue noise sampling scheme. Our method can provide a more detailed and interactive view of these large, volumetric, time-series particle datasets than traditional methods, leading to new insights into these physics simulations.
H. Oh, R. Amici, G. Bomarito, S. Zhe, R. Kirby, J. Hochhalter.
Genetic Programming Based Symbolic Regression for Analytical Solutions to Differential Equations, Subtitled arXiv:2302.03175v1, 2023.
In this paper, we present a machine learning method for the discovery of analytic solutions to differential equations. The method utilizes an inherently interpretable algorithm, genetic programming based symbolic regression. Unlike conventional accuracy measures in machine learning we demonstrate the ability to recover true analytic solutions, as opposed to a numerical approximation. The method is verified by assessing its ability to recover known analytic solutions for two separate differential equations. The developed method is compared to a conventional, purely data-driven genetic programming based symbolic regression algorithm. The reliability of successful evolution of the true solution, or an algebraic equivalent, is demonstrated.
H. Oh, R. Amici, G. Bomarito, S. Zhe, R.M. Kirby, J. Hochhalter.
Inherently interpretable machine learning solutions to differential equations, In Engineering with Computers, 2023.
A machine learning method for the discovery of analytic solutions to differential equations is assessed. The method utilizes an inherently interpretable machine learning algorithm, genetic programming-based symbolic regression. An advantage of its interpretability is the output of symbolic expressions that can be used to assess error in algebraic terms, as opposed to purely numerical quantities. Therefore, models output by the developed method are verified by assessing its ability to recover known analytic solutions for two differential equations, as opposed to assessing numerical error. To demonstrate its improvement, the developed method is compared to a conventional, purely data-driven genetic programming-based symbolic regression algorithm. The reliability of successful evolution of the true solution, or an algebraic equivalent, is demonstrated.
B.A. Orkild, J.A. Bergquist, E.N. Paccione, M. Lange, E. Kwan, B. Hunt, R. MacLeod, A. Narayan, R. Ranjan.
A Grid Search of Fibrosis Thresholds for Uncertainty Quantification in Atrial Flutter Simulations, In Computing in Cardiology, 2023.
Atypical Atrial Flutter (AAF) is the most common cardiac arrhythmia to develop following catheter ablation for atrial fibrillation. Patient-specific computational simulations of propagation have shown promise in prospectively predicting AAF reentrant circuits and providing useful insight to guide successful ablation procedures. These patient-specific models require a large number of inputs, each with an unknown amount of uncertainty. Uncertainty quantification (UQ) is a technique to assess how variability in a set of input parameters can affect the output of a model. However, modern UQ techniques, such as polynomial chaos expansion, require a well-defined output to map to the inputs. In this study, we aimed to explore the sensitivity of simulated reentry to the selection of fibrosis threshold in patient-specific AAF models. We utilized the image intensity ratio (IIR) method to set the fibrosis threshold in the LGE-MRI from a single patient with prior ablation. We found that the majority of changes to the duration of reentry occurred within an IIR range of 1.01 to 1.39, and that there was a large amount of variability in the resulting arrhythmia. This study serves as a starting point for future UQ studies to investigate the nonlinear relationship between fibrosis threshold and the resulting arrhythmia in AAF models.
T. A. J. Ouermi, R. M Kirby, M. Berzins.
HiPPIS A High-Order Positivity-Preserving Mapping Software for Structured Meshes, In ACM Trans. Math. Softw, ACM, Nov, 2023.
ISSN: 0098-3500
DOI: 10.1145/3632291
Polynomial interpolation is an important component of many computational problems. In several of these computational problems, failure to preserve positivity when using polynomials to approximate or map data values between meshes can lead to negative unphysical quantities. Currently, most polynomial-based methods for enforcing positivity are based on splines and polynomial rescaling. The spline-based approaches build interpolants that are positive over the intervals in which they are defined and may require solving a minimization problem and/or system of equations. The linear polynomial rescaling methods allow for high-degree polynomials but enforce positivity only at limited locations (e.g., quadrature nodes). This work introduces open-source software (HiPPIS) for high-order data-bounded interpolation (DBI) and positivity-preserving interpolation (PPI) that addresses the limitations of both the spline and polynomial rescaling methods. HiPPIS is suitable for approximating and mapping physical quantities such as mass, density, and concentration between meshes while preserving positivity. This work provides Fortran and Matlab implementations of the DBI and PPI methods, presents an analysis of the mapping error in the context of PDEs, and uses several 1D and 2D numerical examples to demonstrate the benefits and limitations of HiPPIS.
M. Parashar, T. Kurc, H. Klie, M.F. Wheeler, J.H. Saltz, M. Jammoul, R. Dong.
Dynamic Data-Driven Application Systems for Reservoir Simulation-Based Optimization: Lessons Learned and Future Trends, In Handbook of Dynamic Data Driven Applications Systems: Volume 2, Springer International Publishing, pp. 287--330. 2023.
DOI: 10.1007/978-3-031-27986-7_11
Since its introduction in the early 2000s, the Dynamic Data-Driven Applications Systems (DDDAS) paradigm has served as a powerful concept for continuously improving the quality of both models and data embedded in complex dynamical systems. The DDDAS unifying concept enables capabilities to integrate multiple sources and scales of data, mathematical and statistical algorithms, advanced software infrastructures, and diverse applications into a dynamic feedback loop. DDDAS has not only motivated notable scientific and engineering advances on multiple fronts, but it has been also invigorated by the latest technological achievements in artificial intelligence, cloud computing, augmented reality, robotics, edge computing, Internet of Things (IoT), and Big Data. Capabilities to handle more data in a much faster and smarter fashion is paving the road for expanding automation capabilities. The purpose of this chapter is to review the fundamental components that have shaped reservoir-simulation-based optimization in the context of DDDAS. The foundations of each component will be systematically reviewed, followed by a discussion on current and future trends oriented to highlight the outstanding challenges and opportunities of reservoir management problems under the DDDAS paradigm. Moreover, this chapter should be viewed as providing pathways for establishing a synergy between renewable energy and oil and gas industry with the advent of the DDDAS method.
M. Parashar, I. Altintas.
Toward Democratizing Access to Science Data: Introducing the National Data Platform, In IEEE 19th International Conference on e-Science, IEEE, 2023.
DOI: 10.1109/e-Science58273.2023.10254930
Open and equitable access to scientific data is essential to addressing important scientific and societal grand challenges, and to research enterprise more broadly. This paper discusses the importance and urgency of open and equitable data access, and explores the barriers and challenges to such access. It then introduces the vision and architecture of the National Data Platform, a recently launched project aimed at catalyzing an open, equitable and extensible data ecosystem.
Page 8 of 142