Designed especially for neurobiologists, FluoRender is an interactive tool for multi-channel fluorescence microscopy data visualization and analysis.
Deep brain stimulation
BrainStimulator is a set of networks that are used in SCIRun to perform simulations of brain stimulation such as transcranial direct current stimulation (tDCS) and magnetic transcranial stimulation (TMS).
Developing software tools for science has always been a central vision of the SCI Institute.

SCI Publications

2022


V. Keshavarzzadeh, R.M. Kirby, A. Narayan. “Variational Inference for Nonlinear Inverse Problems via Neural Net Kernels: Comparison to Bayesian Neural Networks, Application to Topology Optimization,” Subtitled “arXiv:2205.03681,” 2022.

ABSTRACT

Inverse problems and, in particular, inferring unknown or latent parameters from data are ubiquitous in engineering simulations. A predominant viewpoint in identifying unknown parameters is Bayesian inference where both prior information about the parameters and the information from the observations via likelihood evaluations are incorporated into the inference process. In this paper, we adopt a similar viewpoint with a slightly different numerical procedure from standard inference approaches to provide insight about the localized behavior of unknown underlying parameters. We present a variational inference approach which mainly incorporates the observation data in a point-wise manner, i.e. we invert a limited number of observation data leveraging the gradient information of the forward map with respect to parameters, and find true individual samples of the latent parameters when the forward map is noise-free and one-to-one. For statistical calculations (as the ultimate goal in simulations), a large number of samples are generated from a trained neural network which serves as a transport map from the prior to posterior latent parameters. Our neural network machinery, developed as part of the inference framework and referred to as Neural Net Kernels (NNK), is based on hierarchical (deep) kernels which provide greater flexibility for training compared to standard neural networks. We showcase the effectiveness of our inference procedure in identifying bimodal and irregular distributions compared to a number of approaches including Markov Chain Monte Carlo sampling approaches and a Bayesian neural network approach.



D. Klotzl, T. Krake, Y. Zhou, I. Hotz, B. Wang, D. Weiskopf. “Local Bilinear Computation of Jacobi Sets,” In The Visual Computer, 2022.

ABSTRACT

We propose a novel method for the computation of Jacobi sets in 2D domains. The Jacobi set is a topological descriptor based on Morse theory that captures gradient alignments among multiple scalar fields, which is useful for multi-field visualization. Previous Jacobi set computations use piecewise linear approximations on triangulations that result in discretization artifacts like zig-zag patterns. In this paper, we utilize a local bilinear method to obtain a more precise approximation of Jacobi sets by preserving the topology and improving the geometry. Consequently, zig-zag patterns on edges are avoided, resulting in a smoother Jacobi set representation. Our experiments show a better convergence with increasing resolution compared to the piecewise linear method. We utilize this advantage with an efficient local subdivision scheme. Finally, our approach is evaluated qualitatively and quantitatively in comparison with previous methods for different mesh resolutions and across a number of synthetic and real-world examples.



D. Klötzl, T. Krake, Y. Zhou, J. Stober, K. Schulte, I. Hotz, B. Wang, D. Weiskopf. “Reduced Connectivity for Local Bilinear Jacobi Sets,” Subtitled “arXiv:2208.07148,” 2022.

ABSTRACT

We present a new topological connection method for the local bilinear computation of Jacobi sets that improves the visual representation while preserving the topological structure and geometric configuration. To this end, the topological structure of the local bilinear method is utilized, which is given by the nerve complex of the traditional piecewise linear method. Since the nerve complex consists of higher-dimensional simplices, the local bilinear method (visually represented by the 1-skeleton of the nerve complex) leads to clutter via crossings of line segments. Therefore, we propose a homotopy-equivalent representation that uses different collapses and edge contractions to remove such artifacts. Our new connectivity method is easy to implement, comes with only little overhead, and results in a less cluttered representation.



R. Lanfredi, J.D. Schroeder, T. Tasdizen. “Localization supervision of chest x-ray classifiers using label-specific eye-tracking annotation,” Subtitled “arXiv:2207.09771,” 2022.

ABSTRACT

Convolutional neural networks (CNNs) have been successfully applied to chest x-ray (CXR) images. Moreover, annotated bounding boxes have been shown to improve the interpretability of a CNN in terms of localizing abnormalities. However, only a few relatively small CXR datasets containing bounding boxes are available, and collecting them is very costly. Opportunely, eye-tracking (ET) data can be collected in a non-intrusive way during the clinical workflow of a radiologist. We use ET data recorded from radiologists while dictating CXR reports to train CNNs. We extract snippets from the ET data by associating them with the dictation of keywords and use them to supervise the localization of abnormalities. We show that this method improves a model's interpretability without impacting its image-level classification.



D. Lange, S. Sahai, J.M. Phillips, A. Lex. “Ferret: Reviewing Tabular Datasets for Manipulation,” Subtitled “OSF Preprint,” 2022.

ABSTRACT

How do we ensure the veracity of science? The act of manipulating or fabricating scientific data has led to many high-profile fraud cases and retractions. Detecting manipulated data, however, is a challenging and time-consuming endeavor. Automated detection methods are limited due to the diversity of data types and manipulation techniques. Furthermore, patterns automatically flagged as suspicious can have reasonable explanations. Instead, we propose a nuanced approach where experts analyze tabular datasets, eg, as part of the peer-review process, using a guided, interactive visualization approach. In this paper, we present an analysis of how manipulated datasets are created and the artifacts these techniques generate. Based on these findings, we propose a suite of visualization methods to surface potential irregularities. We have implemented these methods in Ferret, a visualization tool for data forensics work. Ferret makes potential data issues salient and provides guidance on spotting signs of tampering and differentiating them from truthful data.



Z. Li, S. Liu, X. Yu, K. Bhavya, J. Cao, J. Diffenderfer, P.T. Bremer, V. Pascucci. ““Understanding Robustness Lottery”: A Comparative Visual Analysis of Neural Network Pruning Approaches,” Subtitled “arXiv preprint arXiv:2206.07918,” 2022.

ABSTRACT

Deep learning approaches have provided state-of-the-art performance in many applications by relying on extremely large and heavily overparameterized neural networks. However, such networks have been shown to be very brittle, not generalize well to new uses cases, and are often difficult if not impossible to deploy on resources limited platforms. Model pruning, i.e., reducing the size of the network, is a widely adopted strategy that can lead to more robust and generalizable network -- usually orders of magnitude smaller with the same or even improved performance. While there exist many heuristics for model pruning, our understanding of the pruning process remains limited. Empirical studies show that some heuristics improve performance while others can make models more brittle or have other side effects. This work aims to shed light on how different pruning methods alter the network's internal feature representation, and the corresponding impact on model performance. To provide a meaningful comparison and characterization of model feature space, we use three geometric metrics that are decomposed from the common adopted classification loss. With these metrics, we design a visualization system to highlight the impact of pruning on model prediction as well as the latent feature embedding. The proposed tool provides an environment for exploring and studying differences among pruning methods and between pruned and original model. By leveraging our visualization, the ML researchers can not only identify samples that are fragile to model pruning and data corruption but also obtain insights and explanations on how some pruned …



S. Li, R.M. Kirby, S. Zhe. “Decomposing Temporal High-Order Interactions via Latent ODEs,” In Proceedings of the 39 th International Conference on Machine Learning, 2022.

ABSTRACT

High-order interactions between multiple objects are common in real-world applications. Although tensor decomposition is a popular framework for high-order interaction analysis and prediction, most methods cannot well exploit the valuable timestamp information in data. The existent methods either discard the timestamps or convert them into discrete steps or use over-simplistic decomposition models. As a result, these methods might not be capable enough of capturing complex, finegrained temporal dynamics or making accurate predictions for long-term interaction results. To overcome these limitations, we propose a novel Temporal High-order Interaction decompoSition model based on Ordinary Differential Equations (THIS-ODE). We model the time-varying interaction result with a latent ODE. To capture the complex temporal dynamics, we use a neural network (NN) to learn the time derivative of the ODE state. We use the representation of the interaction objects to model the initial value of the ODE and to constitute a part of the NN input to compute the state. In this way, the temporal relationships of the participant objects can be estimated and encoded into their representations. For tractable and scalable inference, we use forward sensitivity analysis to efficiently compute the gradient of ODE state, based on which we use integral transform to develop a stochastic mini-batch learning algorithm. We demonstrate the advantage of our approach in simulation and four real-world applications.



S. Li, Z Wang, R.M. Kirby, S. Zhe. “Infinite-Fidelity Coregionalization for Physical Simulation,” Subtitled “arXiv:2207.00678,” 2022.

ABSTRACT

Multi-fidelity modeling and learning are important in physical simulation-related applications. It can leverage both low-fidelity and high-fidelity examples for training so as to reduce the cost of data generation while still achieving good performance. While existing approaches only model finite, discrete fidelities, in practice, the fidelity choice is often continuous and infinite, which can correspond to a continuous mesh spacing or finite element length. In this paper, we propose Infinite Fidelity Coregionalization (IFC). Given the data, our method can extract and exploit rich information within continuous, infinite fidelities to bolster the prediction accuracy. Our model can interpolate and/or extrapolate the predictions to novel fidelities, which can be even higher than the fidelities of training data. Specifically, we introduce a low-dimensional latent output as a continuous function of the fidelity and input, and multiple it with a basis matrix to predict high-dimensional solution outputs. We model the latent output as a neural Ordinary Differential Equation (ODE) to capture the complex relationships within and integrate information throughout the continuous fidelities. We then use Gaussian processes or another ODE to estimate the fidelity-varying bases. For efficient inference, we reorganize the bases as a tensor, and use a tensor-Gaussian variational posterior to develop a scalable inference algorithm for massive outputs. We show the advantage of our method in several benchmark tasks in computational physics.



Z. Li, T. Sun, H. Wang, B. Wang. “Adaptive and Implicit Regularization for Matrix Completion,” Subtitled “arXiv preprint arXiv:2208.05640,” 2022.

ABSTRACT

The explicit low-rank regularization, e.g., nuclear norm regularization, has been widely used in imaging sciences. However, it has been found that implicit regularization outperforms explicit ones in various image processing tasks. Another issue is that the fixed explicit regularization limits the applicability to broad images since different images favor different features captured by different explicit regularizations. As such, this paper proposes a new adaptive and implicit low-rank regularization that captures the low-rank prior dynamically from the training data. The core of our new adaptive and implicit low-rank regularization is parameterizing the Laplacian matrix in the Dirichlet energy-based regularization, which we call the regularization AIR. Theoretically, we show that the adaptive regularization of AIR enhances the implicit regularization and vanishes at the end of training. We validate AIR’s effectiveness on various benchmark tasks, indicating that the AIR is particularly favorable for the scenarios when the missing entries are non-uniform. The code can be found at https://github.com/lizhemin15/AIR-Net.



S. Li, J.M. Phillips, X. Yu, R.M. Kirby, S. Zhe. “Batch Multi-Fidelity Active Learning with Budget Constraints,” Subtitled “arXiv:2210.12704v1,” 2022.

ABSTRACT

Learning functions with high-dimensional outputs is critical in many applications, such as physical simulation and engineering design. However, collecting training examples for these applications is often costly, e.g. by running numerical solvers. The recent work (Li et al., 2022) proposes the first multi-fidelity active learning approach for high-dimensional outputs, which can acquire examples at different fidelities to reduce the cost while improving the learning performance. However, this method only queries at one pair of fidelity and input at a time, and hence has a risk to bring in strongly correlated examples to reduce the learning efficiency. In this paper, we propose Batch Multi-Fidelity Active Learning with Budget Constraints (BMFAL-BC), which can promote the diversity of training examples to improve the benefit-cost ratio, while respecting a given budget constraint for batch queries. Hence, our method can be more practically useful. Specifically, we propose a novel batch acquisition function that measures the mutual information between a batch of multi-fidelity queries and the target function, so as to penalize highly correlated queries and encourages diversity. The optimization of the batch acquisition function is challenging in that it involves a combinatorial search over many fidelities while subject to the budget constraint. To address this challenge, we develop a weighted greedy algorithm that can sequentially identify each (fidelity, input) pair, while achieving a near -approximation of the optimum. We show the advantage of our method in several computational physics and engineering applications.



S. Li, M. Penwarden, R.M. Kirby, S. Zhe. “Meta Learning of Interface Conditions for Multi-Domain Physics-Informed Neural Networks,” Subtitled “arXiv preprint arXiv:2210.12669,” 2022.

ABSTRACT

Physics-informed neural networks (PINNs) are emerging as popular mesh-free solvers for partial differential equations (PDEs). Recent extensions decompose the domain, applying different PINNs to solve the equation in each subdomain and aligning the solution at the interface of the subdomains. Hence, they can further alleviate the problem complexity, reduce the computational cost, and allow parallelization. However, the performance of the multi-domain PINNs is sensitive to the choice of the interface conditions for solution alignment. While quite a few conditions have been proposed, there is no suggestion about how to select the conditions according to specific problems. To address this gap, we propose META Learning of Interface Conditions (METALIC), a simple, efficient yet powerful approach to dynamically determine the optimal interface conditions for solving a family of parametric PDEs. Specifically, we develop two contextual multi-arm bandit models. The first one applies to the entire training procedure, and online updates a Gaussian process (GP) reward surrogate that given the PDE parameters and interface conditions predicts the solution error. The second one partitions the training into two stages, one is the stochastic phase and the other deterministic phase; we update a GP surrogate for each phase to enable different condition selections at the two stages so as to further bolster the flexibility and performance. We have shown the advantage of METALIC on four bench-mark PDE families.



Z. Li, H. Menon, K. Mohror, S. Liu, L. Guo, P.T. Bremer, V. Pascucci. “A Visual Comparison of Silent Error Propagation,” In IEEE Transactions on Visualization and Computer Graphics, IEEE, 2022.
DOI: 10.1109/TVCG.2022.3230636

ABSTRACT

High-performance computing (HPC) systems play a critical role in facilitating scientific discoveries. Their scale and complexity (e.g., the number of computational units and software stack) continue to grow as new systems are expected to process increasingly more data and reduce computing time. However, with more processing elements, the probability that these systems will experience a random bit-flip error that corrupts a program's output also increases, which is often recognized as silent data corruption. Analyzing the resiliency of HPC applications in extreme-scale computing to silent data corruption is crucial but difficult. An HPC application often contains a large number of computation units that need to be tested, and error propagation caused by error corruption is complex and difficult to interpret. To accommodate this challenge, we propose an interactive visualization system that helps HPC researchers understand the resiliency of HPC applications and compare their error propagation. Our system models an application's error propagation to study a program's resiliency by constructing and visualizing its fault tolerance boundary. Coordinating with multiple interactive designs, our system enables domain experts to efficiently explore the complicated spatial and temporal correlation between error propagations. At the end, the system integrated a nonmonotonic error propagation analysis with an adjustable graph propagation visualization to help domain experts examine the details of error propagation and answer such questions as why an error is mitigated or amplified by program execution.



Z. Liu, A. Narayan. “A Stieltjes algorithm for generating multivariate orthogonal polynomials,” Subtitled “arXiv preprint arXiv:2202.04843,” 2022.

ABSTRACT

Orthogonal polynomials of several variables have a vector-valued three-term recurrence relation, much like the corresponding one-dimensional relation. This relation requires only knowledge of certain recurrence matrices, and allows simple and stable evaluation of multivariate orthogonal polynomials. In the univariate case, various algorithms can evaluate the recurrence coefficients given the ability to compute polynomial moments, but such a procedure is absent in multiple dimensions. We present a new Multivariate Stieltjes (MS) algorithm that fills this gap in the multivariate case, allowing computation of recurrence matrices assuming moments are available. The algorithm is essentially explicit in two and three dimensions, but requires the numerical solution to a non-convex problem in more than three dimensions. Compared to direct Gram-Schmidt-type orthogonalization, we demonstrate on several examples in up to three dimensions that the MS algorithm is far more stable, and allows accurate computation of orthogonal bases in the multivariate setting, in contrast to direct orthogonalization approaches.



Y. Livnat, D. Maljovec, A. Gyulassy, B. Mouginot, V. Pascucci. “A Novel Tree Visualization to Guide Interactive Exploration of Multi-dimensional Topological Hierarchies,” Subtitled “arXiv preprint arXiv:2208.06952,” 2022.

ABSTRACT

Understanding the response of an output variable to multi-dimensional inputs lies at the heart of many data exploration endeavours. Topology-based methods, in particular Morse theory and persistent homology, provide a useful framework for studying this relationship, as phenomena of interest often appear naturally as fundamental features. The Morse-Smale complex captures a wide range of features by partitioning the domain of a scalar function into piecewise monotonic regions, while persistent homology provides a means to study these features at different scales of simplification. Previous works demonstrated how to compute such a representation and its usefulness to gain insight into multi-dimensional data. However, exploration of the multi-scale nature of the data was limited to selecting a single simplification threshold from a plot of region count. In this paper, we present a novel tree visualization that provides a concise overview of the entire hierarchy of topological features. The structure of the tree provides initial insights in terms of the distribution, size, and stability of all partitions. We use regression analysis to fit linear models in each partition, and develop local and relative measures to further assess uniqueness and the importance of each partition, especially with respect parents/children in the feature hierarchy. The expressiveness of the tree visualization becomes apparent when we encode such measures using colors, and the layout allows an unprecedented level of control over feature selection during exploration. For instance, selecting features from multiple scales of the hierarchy enables a more nuanced exploration. Finally, we …



J. Luettgau, C.R. Kirkpatrick, G. Scorzelli, V. Pascucci, G. Tarcea, M. Taufer. “NSDF-Catalog: Lightweight Indexing Service for Democratizing Data Delivering,” 2022.

ABSTRACT

Across domains massive amounts of scientific data are generated. Because of the large volume of information, data discoverability is often hard if not impossible, especially for scientists who have not generated the data or are from other domains. As part of the NSF-funded National Science Data Fabric (NSDF) initiative, we develop a testbed to demonstrate that these boundaries to data discoverability can be overcome. In support of this effort, we identify the need for indexing large-amounts of scientific data across scientific domains. We propose NSDF-Catalog, a lightweight indexing service with minimal metadata that complements existing domain-specific and rich-metadata collections. NSDF-Catalog is designed to facilitate multiple related objectives within a flexible microservice to: (i) coordinate data movements and replication of data from origin repositories within the NSDF federation; (ii) build an inventory of existing scientific data to inform the design of next-generation cyberinfrastructure; and (iii) provide a suite of tools for discovery of datasets for cross-disciplinary research. Our service indexes scientific data at a fine-granularity at the file or object level to inform data distribution strategies and to improve the experience for users from the consumer perspective, with the goal of allowing end-to-end dataflow optimizations



O.A. Malik, Y. Xu, N. Cheng, S. Becker, A. Doostan, A. Narayan. “Fast Algorithms for Monotone Lower Subsets of Kronecker Least Squares Problems,” Subtitled “arXiv:2209.05662v1,” 2022.

ABSTRACT

Approximate solutions to large least squares problems can be computed efficiently using leverage score-based row-sketches, but directly computing the leverage scores, or sampling according to them with naive methods, still requires an expensive manipulation and processing of the design matrix. In this paper we develop efficient leverage score-based sampling methods for matrices with certain Kronecker product-type structure; in particular we consider matrices that are monotone lower column subsets of Kronecker product matrices. Our discussion is general, encompassing least squares problems on infinite domains, in which case matrices formally have infinitely many rows. We briefly survey leverage score-based sampling guarantees from the numerical linear algebra and approximation theory communities, and follow this with efficient algorithms for sampling when the design matrix has Kronecker-type structure. Our numerical examples confirm that sketches based on exact leverage score sampling for our class of structured matrices achieve superior residual compared to approximate leverage score sampling methods.



N. Morrical, A. Sahistan, U. Güdükbay, I. Wald, V. Pascucci. “Quick Clusters: A GPU-Parallel Partitioning for Efficient Path Tracing of Unstructured Volumetric Grids,” 2022.
DOI: 10.13140/RG.2.2.34351.20648

ABSTRACT

We propose a simple, yet effective method for clustering finite elements in order to improve preprocessing times and rendering performance of unstructured volumetric grids. Rather than building bounding volume hierarchies (BVHs) over individual elements, we sort elements along a Hilbert curve and aggregate neighboring elements together, significantly improving BVH memory consumption. Then to further reduce memory consumption, we cluster the mesh on the fly into sub-meshes with smaller indices using series of efficient parallel mesh re-indexing operations. These clusters are then passed to a highly optimized ray tracing API for both point containment queries and ray-cluster intersection testing. Each cluster is assigned a maximum extinction value for adaptive sampling, which we rasterize into non-overlapping view-aligned bins allocated along the ray. These maximum extinction bins are then used to guide the placement of samples along the ray during visualization, significantly reducing the number of samples required and greatly improving overall visualization interactivity. Using our approach, we improve rendering performance over a competitive baseline on the NASA Mars Lander dataset by 6×(1FPS up to 6FPS including volumetric shadows) while simultaneously reducing memory consumption by 3×(33GB down to 11GB) and avoiding any offline preprocessing steps, enabling high quality interactive visualization on consumer graphics cards. By utilizing the full 48 GB of an RTX 8000, we improve performance of Lander by 17×(1FPS up to 17FPS), enabling new possibilities for large data exploration.



A. Narayan, Z. Liu, J. A. Bergquist, C. Charlebois, S. Rampersad, L. Rupp, D. Brooks, D. White, J. Tate, R. S. MacLeod. “UncertainSCI: Uncertainty quantification for computational models in biomedicine and bioengineering,” In Computers in Biology and Medicine, 2022.
DOI: https://doi.org/10.1016/j.compbiomed.2022.106407

ABSTRACT

Background:

Computational biomedical simulations frequently contain parameters that model physical features, material coefficients, and physiological effects, whose values are typically assumed known a priori. Understanding the effect of variability in those assumed values is currently a topic of great interest. A general-purpose software tool that quantifies how variation in these parameters affects model outputs is not broadly available in biomedicine. For this reason, we developed the ‘UncertainSCI’ uncertainty quantification software suite to facilitate analysis of uncertainty due to parametric variability.

Methods:

We developed and distributed a new open-source Python-based software tool, UncertainSCI, which employs advanced parameter sampling techniques to build polynomial chaos (PC) emulators that can be used to predict model outputs for general parameter values. Uncertainty of model outputs is studied by modeling parameters as random variables, and model output statistics and sensitivities are then easily computed from the emulator. Our approaches utilize modern, near-optimal techniques for sampling and PC construction based on weighted Fekete points constructed by subsampling from a suitably randomized candidate set.
Results:

Concentrating on two test cases—modeling bioelectric potentials in the heart and electric stimulation in the brain—we illustrate the use of UncertainSCI to estimate variability, statistics, and sensitivities associated with multiple parameters in these models.
Conclusion:

UncertainSCI is a powerful yet lightweight tool enabling sophisticated probing of parametric variability and uncertainty in biomedical simulations. Its non-intrusive pipeline allows users to leverage existing software libraries and suites to accurately ascertain parametric uncertainty in a variety of applications.



Q. C. Nguyen, T. Belnap, P. Dwivedi, A. Hossein Nazem Deligani, A. Kumar, D. Li, R. Whitaker, J. Keralis, H. Mane, X. Yue, T. T. Nguyen, T. Tasdizen, K. D. Brunisholz. “Google Street View Images as Predictors of Patient Health Outcomes, 2017–2019,” In Big Data and Cognitive Computing, Vol. 6, No. 1, Multidisciplinary Digital Publishing Institute, 2022.

ABSTRACT

Collecting neighborhood data can both be time- and resource-intensive, especially across broad geographies. In this study, we leveraged 1.4 million publicly available Google Street View (GSV) images from Utah to construct indicators of the neighborhood built environment and evaluate their associations with 2017–2019 health outcomes of approximately one-third of the population living in Utah. The use of electronic medical records allows for the assessment of associations between neighborhood characteristics and individual-level health outcomes while controlling for predisposing factors, which distinguishes this study from previous GSV studies that were ecological in nature. Among 938,085 adult patients, we found that individuals living in communities in the highest tertiles of green streets and non-single-family homes have 10–27% lower diabetes, uncontrolled diabetes, hypertension, and obesity, but higher substance use disorders—controlling for age, White race, Hispanic ethnicity, religion, marital status, health insurance, and area deprivation index. Conversely, the presence of visible utility wires overhead was associated with 5–10% more diabetes, uncontrolled diabetes, hypertension, obesity, and substance use disorders. Our study found that non-single-family and green streets were related to a lower prevalence of chronic conditions, while visible utility wires and single-lane roads were connected with a higher burden of chronic conditions. These contextual characteristics can better help healthcare organizations understand the drivers of their patients’ health by further considering patients’ residential environments, which present both …



T. Nguyen, R.G. Baraniuk, R.M. Kirby, S.J. Osher, B. Wang. “Momentum Transformer: Closing the Performance Gap Between Self-attention and Its Linearization,” Subtitled “arXiv preprint arXiv:2208.00579,” 2022.

ABSTRACT

Transformers have achieved remarkable success in sequence modeling and beyond but suffer from quadratic computational and memory complexities with respect to the length of the input sequence. Leveraging techniques include sparse and linear attention and hashing tricks; efficient transformers have been proposed to reduce the quadratic complexity of transformers but significantly degrade the accuracy. In response, we first interpret the linear attention and residual connections in computing the attention map as gradient descent steps. We then introduce momentum into these components and propose the \emphmomentum transformer, which utilizes momentum to improve the accuracy of linear transformers while maintaining linear memory and computational complexities. Furthermore, we develop an adaptive strategy to compute the momentum value for our model based on the optimal momentum for quadratic optimization. This adaptive momentum eliminates the need to search for the optimal momentum value and further enhances the performance of the momentum transformer. A range of experiments on both autoregressive and non-autoregressive tasks, including image generation and machine translation, demonstrate that the momentum transformer outperforms popular linear transformers in training efficiency and accuracy.