Since 2017, when Vaswani et al from Google published Attention Is All That You Need paper, transformer architecture has been used in large language models (LLMs).
Of late, there is emergence of non-attention architecture for language modelling, e.g. Mamba which shows promising results in various experiments.
In fact, Mamba belongs to state-space model (SSM) which is a mathematical model used to describe the evolution of a system over a period of time.
The key concepts of SSMs are State Variables (x) representing the internal state of the system, State Equation showing how the state variables change over time, both in continuous time and discrete time, Output Equation shows observed outputs of the system to its internal state. Matrices A, B, C, and D which are parameters of the SSM (A represents system dynamics, B the input matrix, C the output matrix and D the feed forward matrix).
The SSMs are designed as linear time invariant systems where A, B, C and D are constants and system’s behaviour is linear.
SSMs are formulated both in continuous time (using differential equations) and discrete time (using difference equations).
Mamba serves as a versatile sequence model foundation. Mamba-3B model surpasses similarly sized transformers and competes on par with Transformers twice its size.
SSMs offer a different lens on sequence modelling. As SSMs focus on internal state evolving over time (hidden dynamics), It captures long-range dependencies and context effectively.
As we know, LLMs based on attention show Blackbox effect. SSMs provide structured representation of the system. These provide greater interpretability.
Mamba as an SSM shows more computational efficiency compared to transformers.
The most impressive thing about transformers is their impressive expressive power. They excel at capturing intricate relationship. They can generate diverse output.
SSMs may require more data for training as compared to transformers. (They have to master both state transitions and observation equations).
SSMs are appealing theoretically but their implementation and optimization may be more complex than transformers.
Though a promising development, their working requires further exploration and comparison with transformers. Both these architectures may possibly evolve and co-exist serving different needs and domains.