Here a transformer incorporates a memory cache to enhance its capacity to tackle long-range dependencies in sequences. In traditional transformers, there is always a struggle to capture relationships between distant elements in long sequences.
The key components of a cached transformer include a GRC or Gated Recurrent Cache. This component stores dynamically token embeddings based on their relevance and historical significance.
It serves as a differentiable memory cache. The model can attend to both current and previously seen information.
Tokenized embeddings are converted into numerical veotors. GRC processes these veotors. It stores relevant information in its cache. The transformer self-attention mechanism now can attend to both to current input tokens and cached historical information. It performs standard self attention on the combined input and cached representations. This processed information is further refined through normalization and feed-forward layers. The GRC constantly updates its cache based on current input tokens. It thus ensures that the stored information remains relevant.
Advantages:
There is improved handling of long-range dependencies. The model captures relationships between distant elements in long sequences. It enhances its performance, e.g. in language modelling, machine translation, image classification and instance segmentation.
It reduces computation cost as compared to traditional transformers. It is a promising advance in transformer architecture. Further research is expected to explore their full capabilities and potential impact on language and vision processing. Researchers from the Chinese University of Hong Kong, the University of Hon Kong and Tencent Inc. propose this innovative approach called Cached Transformers with a GRC.