TaylorShift: Shifting the Complexity of Self-Attention from Squared to Linear (and Back) using Taylor-Softmax

Algorithm for calculating the efficient variant of TaylorShift.

Abstract

The quadratic complexity of the attention mechanism represents one of the biggest hurdles for processing long sequences using Transformers. Current methods, relying on sparse representations or stateful recurrence, sacrifice token-to-token interactions, which ultimately leads to compromises in performance. This paper introduces TaylorShift, a novel reformulation of the Taylor softmax that enables computing full token-to-token interactions in linear time and space. We analytically determine the crossover points where employing TaylorShift becomes more efficient than traditional attention, aligning closely with empirical measurements. Specifically, our findings demonstrate that TaylorShift enhances memory efficiency for sequences as short as 800 tokens and accelerates inference for inputs of approximately 1700 tokens and beyond. For shorter sequences, TaylorShift scales comparably with the vanilla attention. Furthermore, a classification benchmark across five tasks involving long sequences reveals no degradation in accuracy when employing Transformers equipped with TaylorShift. For reproducibility, we provide access to our code under https://github.com/tobna/TaylorShift.

Publication
In arXiv

Introduction

Ever since their introduction by Vaswani et al., Transformers have revolutionized numerous domains of deep learning, from Natural Language Processing to Computer Vision, while also underpinning the emergence of novel applications such as Large Language Models. This success stems largely from their ability to capture intricate dependencies and token-to-token interactions using the attention module.

To extend the utility of Transformers to more complex tasks, they need to be able to process long sequences. However, with a computational complexity of the attention mechanism scaling quadratically in the length of the input sequence $\mathcal O(N^2)$, computing twice as many sequence elements requires four times the number of computations, which hinders scaling to very long context windows. This bottleneck makes some practitioners turn to approaches like compressing portions of the input into single states (Dai et al., 2019; Bulatov et al., 2023; Tworkowski et al., 2023) which reduces the amount of information available at each step. Despite this progress, exploiting long context windows to significantly improve performance and incorporate new information without retraining remains challenging. Current Transformers encounter limitations when processing extensive documents, high-resolution images, or a combination of data from multiple domains and modalities. Especially when considering the limited resources of smaller enterprises or even individual consumers.

While linearly scaling Transformers have been proposed, these often experience compromised accuracy (Nauen et al., 2023), specialize in a particular domain (Zaheer et al., 2020; Liu et al., 2021), or only convey averaged global information across tokens, neglecting individual token-to-token interactions (El-Nouby et al., 2021; Babiloni et al., 2023). As a result, these models end up being ill-suited for handling longer sequences, leaving the standard Transformer as the preferred choice for many tasks due to its extensive capacity and established performance (Lin et al., 2022).

In this work, instead of focusing on linearizing the model, we approach this bottleneck by reformulating the softmax function in the attention mechanism itself after introducing the Taylor approximation of $\exp$. While some methods alter this function too, their goal is to split interactions between queries and keys for computing global average interactions (Babiloni et al., 2023; Katharopoulos et al., 2020; Choromanski et al., 2021). In contrast, our proposed approach, TaylorShift, preserves individual token-to-token interactions. Combining a novel tensor-product-based operator with the Taylor approximation of the softmax’s exponential function allows us to compute full token-to-token interactions in linear time. For short sequences, TaylorShift can default back to quadratic scaling to preserve global efficiency. Moreover, our formulation has the added benefit of adhering to concrete error bounds when viewed as an approximation of vanilla attention (Keles et al., 2023). We show that a naive implementation of this linearization is unstable and propose a novel normalization scheme that enables its practical implementation.

This paper starts with an exploration of related work in Section 2, providing context for our contributions. In Section 3, we introduce two implementations of TaylorShift and our novel normalization scheme. Beyond the $\mathcal O$-notation, we delve into the efficiency analysis of TaylorShift, identifying specific conditions where it excels, both theoretically (Section 4) and empirically (Section 5). Finally, we summarize and contextualize our main findings in Section 6.

For more information see the pdf.

References

(Babiloni et al., 2023) F. Babiloni et al. “Linear Complexity Self-Attention with 3rd Order Polynomials”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.

(Bulatov et al., 2023) A. Bulatov, Y. Kuratov, M. S. Burtsev “Scaling Transformers to 1M tokens and beyond with RMT”. In: arXiv preprint arXiv:2304.11062, 2023.

(Choromanski et al., 2021) K. M. Choromanski et al. “Rethinking attention with Performers”. In: International Conference on Learning Representations, 2021.

(Dai et al., 2019) Z. Dai et al. “Transformer-XL: Attentive language models beyond a fixed-length context”. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019.

(El-Nouby et al., 2021) A. El-Nouby et al. “XCiT: Cross-Covariance Image Transformers”. In: Advances in Neural Information Processing Systems, 2021.

(Katharopoulos et al., 2020) A. Katharopoulos et al. “Transformers are RNNs: Fast autoregressive transformers with linear attention”. In: International Conference on Machine Learning, 2020.

(Keles et al., 2023) F. D. Keles et al. “On the computational complexity of self-attention”. In: International Conference on Algorithmic Learning Theory, 2023.

(Lin et al., 2022) Z. Lin et al. “A Survey of Transformers”. In: AI Open, 2022.

(Liu et al., 2021) Z. Liu et al. “Swin Transformer: Hierarchical vision transformer using shifted windows”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.

(Nauen et al., 2023) T. C. Nauen, S. Palacio, A. Dengel “Which Transformer to Favor: A Comparative Analysis of Efficiency in Vision Transformers”. In: arXiv preprint arXiv:2308.09372, 2023.

(Tworkowski et al., 2023) A. Tworkowski et al. “Focused Transformer: Contrastive training for context scaling”. In: Thirty-seventh Conference on Neural Information Processing Systems, 2023.

(Vaswani et al., 2017) A. Vaswani et al. “Attention is all you need”. In: Advances in Neural Information Processing Systems, 2017.

(Zaheer et al., 2020) M. Zaheer et al. “Big Bird: Transformers for longer sequences”. In: Proceedings of the 34th International Conference on Neural Information Processing Systems, 2020.

Tobias Christian Nauen
Tobias Christian Nauen
PhD Student

My research interests include efficiency of machine learning models, multimodal learning, and transformer models.