AI is here, and it brings with it a new landscape of potential innovation.
By: Vineet Tiruvadi (Founder, nForm, LLC), Marek Sirendi (Head of AI, Ascertain Systems, Inc.), Mark Michalski (CEO, Ascertain Systems, Inc.)
(view original article on Towards AI)
Driven largely by the power of the Transformer (Tf) architecture, large language models (LLMs) have brought AI to the forefront of public discussion and technological innovation [Vaswani 2023].
The surprising effectiveness of Tfs in generating human-sounding, plausible sentences has taken the world by surprise — bringing with it a fast flood of AI-powered products spanning every sector of service.
Some domains, like healthcare and defense, require capabilities beyond just plausible sentences — they require efficiency and efficacy in problem-solving. While transformers show impressive capability in language tasks, alternative architectures that trade-off expressivity for domain-specific efficiency and efficacy will become increasingly relevant.
One promising, and structurally interesting, alternative is the general class of State Space Models. The most popular amongst these in 2024 is Mamba, developed by Albert Gu and Tri Dao (Gu and Dao [2024]). Recent investigations have confirmed improved efficiency in SSMs, even finding explanations for improvements in both training and inference (Zucchet and Orvieto [2024]).
Seeing how states flow in their ambient space can give us a lot of information beyond sequence prediction. This depiction of the Lorenz Attractor shows how states can flow, yielding trajectories and measured sequences.
Seeing how states flow in their ambient space can give us a lot of information beyond sequence prediction. This depiction of the Lorenz Attractor shows how states can flow, yielding trajectories and measured sequences. Img source: https://www.linkedin.com/pulse/chaos-can-seductive-sabu-francis/
Uniquely, SSMs like Mamba may also enable improved reasoning — enabling more efficient and effective Agentic AI that executes interventions. Control Theory problems are, by nature, focused on optimizing intervention — and benchmarks that utilize them may provide a more direct assessment of an LLM’s problem-solving capabilities (Kevian et al. [2024]).
In this piece, we outline the differences between Transformer-based models (Tf) and State Space-inspired models (SSM). Starting with a brief historical context, we’ll outline key architectural highlights of SSM. We then benchmark efficiency and efficacy between SSM and Tf-based models, with a novel utilization of AgentBench and ControlBench in our assessment. We’ll leave with a brief discussion about the technical and product promise of SSMs, and the challenges to achieving that promise.
History and Motivation
While there are a lot of exciting advances in the LLM space today, SSMs have a unique advantage — they’ve got a history of utility that stretches back more than 50 years.
History
The concept of a state space has a rich history in fields outside of LLM or natural language processing. Originally proposed by Kalman in 1960 (Kalman [1960], Yamamoto [2019]), the state space approach brought an elegance to bear on previously challenging problems.
R E Kalman had quite a lifetime impact. https://www.eng.ufl.edu/newengineer/in-memoriam/remembering-rudolf-e-kalman-1930-2016
It did so by focusing less on understanding how measurements evolve, but rather to focus on how an underlying latent state evolves. Kalman’s influence on Control Theory was pervasive, with his state space framework enabling results in both inference and intervention (Antoulas [1991]).
Control Theory, a discipline of engineering, is focused on understanding and designing interventions in systems — with or without understanding Thomas et al. [2019]. The approaches in Control Theory have been developed over decades, with the recent progress in ML, AI, and compute galvanizing the stalled field. Modern efforts in Control Theory, like Koopman Operator theory and data-driven dynamics learning, are directly informed by SSMs and demonstrate promise for optimizing intervention even in systems with complex dynamics (Bevanda et al. [2021], Kaiser et al. [2018]).
Figure 1: State Spaces vs Sequence Spaces. State Space Models are the foundation of many modern analytical techniques. By separating the state space from the measured/output sequences, we may better capture the dynamics underlying sentences. Image Source: Brunton [2019].
State space modeling has found significant utility outside of traditional Control Theory (Durbin and Koopman [2012]). The emphasis on capturing the state evolutions underlying observed trajectories, and not the trajectories directly, has yielded parsimonious models that bring prediction and interpretability. The idea that the evolution is reducible to a simple “dynamics” map led to more tractable solutions since they dissociated the measurements, and their independent dynamics, from the core task at hand: controlling states (Figure 1). While early, the use of state space models in healthcare has yielded exciting, effective insights (Heidari et al. [2024], Paninski et al. [2010]).
SSM integration into LLMs is more recent and at a shallower level than other domains (Gu and Dao [2024]). However, if the history of Control Theory is any indication, SSM may be the key ingredient to bringing parsimony and elegant reasoning to AI (Fahrmeir [undated]). In particular, with decision-making problems that map onto those already studied and solved via Control Theory.
Motivation
While Transformer-based models have demonstrated impressive performance in benchmarks, SSMs may directly address several of their main limitations.
Efficiency of Inference: Tf models exhibit the quadratic bottleneck by requiring communication between each pair of tokens. As applications demand larger context windows, Tf-based models yield rapidly diminishing efficiencies.
Explainability: By developing a deep statistical model of token-to-token relationships, the underlying reason for correlations remains hidden. This makes instructions arising from LLMs more correlational than causal with respect to reasoning chains.
Summarization: While Tf performance is high, their responses suffer from general errors and hallucinations suggestive of an inappropriate summarization — think averaging between apples and oranges. Forcing prediction through a single state space may yield a simpler latent map that better captures the problem underlying the text sequences carrying it.
Intervention: A major effort in AI research is focused on providing optimal interventions, beyond most-likely sentence completion. This is a surprisingly difficult challenge, and a central topic in the field of optimal control.
Architecture
Let’s understand, at the lowest level, how the parts of SSM differ from Tf.
Transformers (Tf)
Tf have a, now, well-studied architecture.
Figure 2: Transformer architecture, where self-attention causes compute requirements to grow quadratically. Figure adapted from Vaswani et al. [2023].
In a transformer, every token looks back at every previous token when making predictions. This requires computationally expensive matrix multiplication to be carried out during the inference of each subsequent token.
This pairwise communication means a forward pass is O(n2) time complexity in training (the dreaded quadratic bottleneck), and each new token generated autoregressively takes O(n) time. In other words, as the context size increases, the model gets slower.
Limitations of Transformers:
01. Quadratic Bottleneck: Transformers use a self-attention mechanism that allows every token to look at every other token in the sequence, leading to a quadratic increase in training time complexity for long sequences.
02. Memory Bottleneck: Storing information about all previous tokens requires a lot of memory, making it difficult to handle very long sequences.
03. Interpretability: Transformers have myriad parameters that are not readily interpretable. This limits the ability to debug and improve models rationally post-training.
State Space Models (SSMs)
State Space Models (SSMs) employ a Control Theory-inspired framework that focuses on relevant information from the past instead of attending to everything.
Figure 3: Mamba Architecture more efficiently carries the sequences heretofore in a state that is then used for the next step sequence being inferred. Figure adapted from Gu and Dao [2024].
They do so by maintaining a state that captures the essential information about the past that is relevant for predicting the next element in the sequence.
State refers to the variables that govern how input variables are transformed over time, fully determining the future system behavior. In theory, once we have the state, there’s nothing else we need to know about the past to predict the future. Ideally, the state is a fairly small amount of information that captures the essential dynamics of the system. That is, the state in its state space fully compresses the past that occurred before it.
Mamba is one of the more popular implementations of SSM, alongside S4 and H3 Gu and Dao [2024], Gu et al. [2022]. By using selective state spaces, Mamba balances the benefits of SSM with the practical expectations of LLMs today. Its architecture addresses the limitations of Tf architecture in several ways:
01. Long-term Memory:Mamba can handle very long sequences because it focuses on the most relevant information and forgets less relevant details.
02. Effectiveness/Efficiency Trade-off:Transformers are more effective for short sequences due to their ability to attend to all information, but they become less efficient for longer sequences. Mamba offers a better balance between effectiveness and efficiency for long sequences, enabling longer, more detailed responses from LLMs.
03. Information Flow:Transformers rely on short-term memory (context window) while Mamba utilizes a “hard drive” like long-term memory (state).
04. Optimizes for Optimizers:One recent result explains SSM efficacy and efficiency in terms of diagonalizing the underlying inference (Zucchet and Orvieto [2024]).
Benchmarking
Efficiency in terms of compute resources becomes critical at scale and for edge applications, while efficacy in instruction-following and problem solving is crucial for AI decision-making and planning.
Preliminary benchmarks in the literature are demonstrating Mamba’s improved performance (Xu [2024], Waleffe et al. [2024]), but more systematic benchmarking is needed to better understand the capabilities and limitations of the architecture.
To compare SSM with Tf architectures, we benchmarked efficiency and efficacy, with efficacy defined in two ways:
01. Agentic capabilities as determined through AgentBench.
02. Reasoning as determined through ControlBench.
Efficiency
The Mamba architecture has a smaller footprint (Figure 3 vs 2) and lower compute requirements by design. A prior, we expect SSMs to be more efficient.
Preliminary benchmarks confirm this efficiency (Gu and Dao [2024], Xu [2024], Figure 4). Inference-time throughput of Mamba demonstrated significant improvement over Tf (Figure 4).
Figure 4: SSMs are likely to be more performant than transformers. a) Beyond the core architectural differences, SSMs have optimizations that can yield uniformly improved training efficiency. b) The throughput at inference time can achieve significant gains over a Tf, as expected given the streamlined architecture. Figure adapted from Gu and Dao [2024].
This translates to quicker responses in user-facing applications and more natural interactions provided the underlying hardware libraries are suitably optimized.
Efficacy: Instruction Following
Efficacy of LLMs is a critical component in any application — does the LLM actually generate something useful towards some goal? This is particularly important in Agent-Based AI, where outputs are chained together and errors can compound (Liu et al. [2023]).
A first test of efficacy is instruction following — how well does Mamba perform in the database section of AgentBench?
Figure 5: DBBench Results. a) Jamba, a Tf/SSM hybrid model, performed at the bottom of the API list. b) Jamba performed better than all tested open-source models.
Looking at Mamba in DBBench, we see poor performance (Figure 5), likely as a result of fine-tuning towards assistant-level performance and away from agent-level uses. This makes it clear that publicly available Mamba LLMs are currently insufficient for Agentic workflows without additional configuration.
We also assessed efficacy of SSM indirectly through a hybrid SSM/Tf model called Jamba Lieber et al. [2024]. Jamba performed at the bottom of the API list (Table 1a) and towards the top of the open-source LLM list (Table 1b). Given the relative lack of optimizations performed for SSM, we interpret this result as a promising starting point for SSM.
Efficacy: Problem Solving
To compare effectiveness in problem solving, we used ControlBench (Kevian et al. [2024]). This benchmark uses classic problems in Control Theory to assess particular types of problem solving.
We found that SSM, while still new and relatively underoptimized compared to Transformers, achieve middle-of-the-pack performance in inference time tokens-per-second (Table 1).
One major challenge in directly comparing SSM to Tf is the lack of standardization in AI model benchmarking efforts. Direct comparison of Mamba with Tf architectures is desirable, but requires further standardization of benchmarking tools.
Our goal here was to isolate the contributions of the architecture, but several major challenges were present. Importantly, we actively ignored models that achieve higher performance (Figure 5) due the the presence of other, opaque components beyond the core LLM — this confounded our goal of comparing SSM to Tf. LLMs require careful fine-tuning to achieve the level of precision and robustness needed in agentic pipelines — and this careful fine-tuning is outside of the scope of our efforts here.
Ultimately, an empirical understanding of out-of-the-box performance can help determine the best architecture for product R&D, and our efforts aim to push us in that direction.
Contrast and Context
So how do SSMs contrast with Tf in the technical and product domains? What are we to take away from this, and other, experiments with SSMs?
Technical Considerations
SSMs are exciting for various technical reasons, and their streamlined architecture already demonstrates improvements in standard efficiency benchmarks Gu and Dao [2024]. While transformers excel at sentence completion, they are not very efficient, especially as context grows. Because they store everything from their past, their requirements grow rapidly — the quadratic bottleneck Keles et al. [2022].
Unlike Transformers, RNNs forget a lot of information, only recalling a small amount in their hidden state and discarding the rest. They are very efficient in that the map they’re carrying for prediction, in the state space, is small. Yet they are less effective as discarded information cannot be recovered. Mamba lies in the middle and attempts to capture the best of both worlds. More generally, SSMs provide a powerful alternative architecture to the transformer that carries along a summary state of the previous sequences.
Major technical challenges exist with SSMs, as we can see in our benchmarks of Mamba. One of the biggest hurdles is the lack of optimization in the SSM space compared to the Tf space. Another challenge is the potentially critical loss of expressivity — SSMs may be more appropriate for narrow-task situations and not for current efforts to generalize.
Product Implications
Real-world products operate in resource constrained, time-sensitive environments — Tf-based LLMs may not scale feasibly in the consumer space. SSMs provide a potential alternative for resource-constrained situations that may complement larger, Tf-based models.
Beyond this efficiency improvement, the ability to guide and shape SSM summarization may yield more interpretable and tunable models. In the setting of healthcare, this can translate directly to improved reliability and robustness, especially as a growing effort to build systems models in biology continues (Angarita-Rodríguez et al. [2023]).
Swarm of Ants working efficiently towards… something?
As agent-based AI grows, the ability of LLM agents to reason between words and actions will translate to a competitive edge in products, and SSMs that prioritize interpretability of their latent space mappings may enable more rational improvements to models and end-to-end performance.
Final Thoughts
In this post, we highlighted aspects of state space models (SSM) by comparing them to Transformers (Tf), and highlighting their long history and track record of impact. Their efficiencies may be crucial for agentic AI to be rolled out at scale and contribute to meaningful gains in productivity.
Our AgentBench testing of Mamba demonstrated poor-instruction following, likely because SSM-based models have not yet been fine-tuned broadly enough. We strongly urge the community to prioritize SSM for intruction-following, as agentic AI may be a natural domain for SSMs to excel.
Our ControlBench testing of Mamba demonstrated improved complex problem solving, though comparison against a Tf-based model with more parity remains needed. These preliminary results are promising, and further study of the hypothesis that the selective state space mechanism drives this improved performance is needed.
Beyond the benchmarks, we speculate that Mamba’s SSM approach opens up a wide range of interesting possibilities. In combination with other intelligence focused benchmarks (Chollet [2024]), ControlBench may better assess the higher-order capabilities we expect in AI systems. Perhaps integration of non-language state space models from domain-specific research becomes possible in SSM-powered LLMs, providing an alternative approach to reasoning that leverages the rich prior work informing real-world decision-making already. This must be studied carefully and transparently, as some critical investigations of SSMs find hints of intrinsic weaknesses (Merrill et al. [2024]).
While we’re still early in this latest resurgence of SSMs, and indeed in the LLM story more broadly, the unique architecture of SSMs are showing unique capabilities that will be of great interest to impactful AI products spanning all domains.
References
01. Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces, 2024.
02. R E Kalman. A new approach to linear filtering and prediction problems. J. Basic Eng., 82(1):35–45, March 1960.
03. Yutaka Yamamoto. The influence of r. e. kalman — state space theory, realization, and sampled-data systems. Annu. Rev. Control, 47:1–6, January 2019.
04. Athanasios C Antoulas. Mathematical System Theory. Springer Berlin Heidelberg, 1991.
05. Peter J Thomas, Mette Olufsen, Rodolphe Sepulchre, Pablo A Iglesias, Auke Ijspeert, and Manoj Srinivasan. Control theory in biology and medicine : Introduction to the special issue. Biol. Cybern., 113(1–2):1–6, April 2019.
06. Petar Bevanda, Stefan Sosnowski, and Sandra Hirche. Koopman operator dynamical models: Learning, analysis and control. arXiv, February 2021.
07. E Kaiser, J N Kutz, and S L Brunton. Sparse identification of nonlinear dynamics for model predictive control in the low-data limit. Proc. Math. Phys. Eng. Sci., 474(2219):20180335, November 2018.
08. Steven L. Brunton. Notes on koopman operator theory. In Data-Driven Science and Engineering: Machine Learning, Dynamical Systems, and Control, 2019. URL https://api.semanticscholar.org/CorpusID:229348032
09. James Durbin and Siem Jan Koopman. Time Series Analysis by State Space Methods. Oxford University Press, 05 2012. ISBN 9780199641178. doi:10.1093/acprof:oso/9780199641178.001.0001. URL https://doi.org/10.1093/acprof:oso/9780199641178.001.0001
10. Moein Heidari, Sina Ghorbani Kolahi, Sanaz Karimijafarbigloo, Bobby Azad, Afshin Bozorgpour, Soheila Hatami, Reza Azad, Ali Diba, Ulas Bagci, Dorit Merhof, and Ilker Hacihaliloglu. Computation-efficient era: A comprehensive survey of state space models in medical image analysis, 2024.
11. Liam Paninski, Yashar Ahmadian, Daniel Gil Ferreira, Shinsuke Koyama, Kamiar Rahnama Rad, Michael Vidne, Joshua Vogelstein, and Wei Wu. A new look at state-space models for neural data. J. Comput. Neurosci., 29(1–2): 107–126, August 2010.
12. Ludwig Fahrmeir. State space models: a brief history and some recent developments. Cite-SeerX, undated. URLhttps://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=7393027fc27ed188fe5f0ae54e033f2a1dc8d48f
13. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023.
14. Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces. In The International Conference on Learning Representations (ICLR), 2022.
15. Nicolas Zucchet and Antonio Orvieto. Recurrent neural networks: vanishing and exploding gradients are not the end of the story, 2024.
16. Zhichao Xu. Rankmamba: Benchmarking mamba’s document ranking performance in the era of transformers, 2024.
17. Roger Waleffe, Wonmin Byeon, Duncan Riach, Brandon Norick, Vijay Korthikanti, Tri Dao, Albert Gu, Ali Hatamizadeh, Sudhakar Singh, Deepak Narayanan, Garvit Kulshreshtha, Vartika Singh, Jared Casper, Jan Kautz, Mohammad Shoeybi, and Bryan Catanzaro. An empirical study of mamba-based language models, 2024.
18. Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. Agentbench: Evaluating LLMs as agents, 2023.
19. Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, Omri Abend, Raz Alon, Tomer Asida, Amir Bergman, Roman Glozman, Michael Gokhman, Avashalom Manevich, Nir Ratner, Noam Rozen, Erez Shwartz, Mor Zusman, and Yoav Shoham. Jamba: A hybrid transformer-mamba language model, 2024.
20. Darioush Kevian, Usman Syed, Xingang Guo, Aaron Havens, Geir Dullerud, Peter Seiler, Lianhui Qin, and Bin Hu. Capabilities of large language models in control engineering: A benchmark study on gpt-4, claude 3 opus, and gemini 1.0 ultra, 2024.
21. Feyza Duman Keles, Pruthuvi Mahesakya Wijewardena, and Chinmay Hegde. On the computational complexity of self-attention, 2022.
22. Andrea Angarita-Rodríguez, Yeimy González-Giraldo, Juan J Rubio-Mesa, Andrés Felipe Aristizábal, Andrés Pinzón, and Janneth González. Control theory and systems biology: Potential applications in neurodegeneration and search for therapeutic targets. Int. J. Mol. Sci., 25(1), December 2023.
23. François Chollet. ARC-AGI: The abstraction and reasoning corpus. https://github.com/fchollet/ARC-AGI, n.d. Accessed: 2024–06–24. William Merrill, Jackson Petty, and Ashish Sabharwal. The illusion of state in state-space models, 2024.