Paper-Conference | Tobias Nauen

When 512×512 is not Enough: Local Degradation-Aware Multi-Diffusion for Extreme Image Super-Resolution

Sun, 14 Sep 2025 13:00:00 +0000

For more information, see the .

Associated Projects: , ,

Which Transformer to Favor: A Comparative Analysis of Efficiency in Vision Transformers

Fri, 28 Feb 2025 00:00:00 +0000

Introduction

The Transformer architecture is one of the most successful models in deep learning, outperforming traditional models in multiple domains from language modeling to computer vision. However, a major challenge in working with Transformer models is their computational complexity of $\mathcal O(N^2)$ in the size of the input $N$. Therefore, researchers have proposed a multitude of modifications to overcome this hurdle and make Transformers more efficient.

However, it is unclear which modifications and overall strategies are the most efficient. That’s why in this paper, we will answer the following questions for the domain of image classification:

Which specific modifications and overall strategies are the most efficient?
Are these modifications even worth considering over the baseline transformer?
What other dimensions influence efficiency, and how can I scale up my setup efficiently?

We tackle these questions by training more than 45 transformer variants from scratch, ensuring fair and comparable evaluation conditions. These transformer variants have been proposed to increase the efficiency for the domains of language or computer vision. Then we measure their speed and memory requirements, both at training and inference time. We additionally compare to the theoretical metrics of parameters and FLOPs. Our analysis is based on the Pareto front, the set of models that provide an optimal tradeoff between model performance and one aspect of efficiency. It lets us analyze the complex multidimensional tradeoffs involved in judging efficiency. In out plots, Pareto optimal models have a black dot, while the others have a white dot. For an example, see .

To see the interactive plots, go down to the .

Efficient Transformers for Computer Vision

Basics of the Transformer Architecture

We briefly describe the key elements of ViT (the Transformer baseline for image classification), that have been studied to make it more efficient, as well as its key bottleneck: the $\mathcal O(N^2)$ computational complexity of self-attention. ViT is an adaption of the original Transformer, taking an image as an input, which is converted into a sequence of non-overlapping patches of size $p \times p$ (usually $p = 16$). Each patch is linearly embedded into a token of size $d$, with a positional encoding being added. A classification token [CLS] is appended to the sequence, which is then fed through a Transformer encoder. There, the self-attention mechanism computes the attention weights $A$ from the queries $Q \in \mathbb R^{N \times d}$ and keys $K \in \mathbb R^{N \times d}$ for each token from the sequence:

$$ A = \text{softmax}\left( \frac{QK^\top}{\sqrt{d_\text{head}}} \right) \in \mathbb R^{N \times N} $$

This matrix encodes the global interactions between every possible pair of tokens, but it’s also the reason for the inherent $\mathcal O(N^2)$ computational complexity of the attention mechanism. The output of attention is a sum over the values $V$ weighted by the attention weights: $X_\text{out} = AV$. After self-attention, the sequence elements are passed through a 2-layer MLP. In the end, only the [CLS] token is used for the classification decision.

Efficiency-Improving Changes

We systematically classify the efficient models using a two step approach:

Where does the model change the baseline ViT: Ath the token-mixing mechanism, the token sequence, or at the MLP block?
How and using what strategy does the model change the baseline?

(i) Token Mixing

The first and most popular approach is to change the token mixing mechanism, which directly tackles the $\mathcal O(N^2)$ computational complexity of self-attention. We identify 7 strategies for changing the token mixing mechanism to make it more efficient:

Low-Rank Attention leverages the fact that $QK^\top \in \mathbb R^{N \times N}$ is a matrix of rank $r \leq d \ll N$ and approximates it by using a low-rank representation.
Sparse Attention builds on most of the attention values being very small and only explicitly calculates a subset of values of $A$.
Fixed Attention uses a fixed the attention matrix for all samples.
Kernel Attention splits the $\text{softmax}$ into two functions to be applied to $Q$ and $K$ individually, so $A$ does not have to be calculated explicitly: $$ X_\text{out} = \phi(Q) \phi(K)^\top V. $$
Hybrid Attention combines the attention mechanism with convolution layers.
Fourier Attention uses the Fast Fourier Transform (FFT) to calculate the interactions in Fourier space with $\mathcal O(N \log N)$ complexity.
Non-Attention Shuffling refers to other techniques of capturing interactions without using attention.

(ii) Token Sequence

Models that change up the token sequence are more prevalent in CV compared to NLP. The idea is to remove redundant information and in doing so, using the $\mathcal O(N^2)$ complexity to our advantage. Removing 30% of the tokens reduces the computational cost of self-attention by approximately 50%. The strategies we identify are:

Token Removal: Removing unimportant tokens without losing critical information.
Token Merging: Merging tokens to remove redundant information.
Summary Tokens: Condensing the information from the sequence into a small set of new tokens.

(iii) MLP Block

The final way of changing the architecture was only taken by two methods. Their idea was to move computations from self-attention into the efficient MLP blocks. This is done by expanding the MLPs or exchanging self-attention layers for more MLPs.

List of Models

Where?	What?	Model Name
Token Mixing	Low-Rank Attention	Linformer
		Nyströmformer
		XCiT
	Sparse Attention	Swin Transformer
		SwinV2
		HaloNet
		Routing Transformer
		Sinkhorn Transformer
		Informer
		Wave-ViT
	Fixed Attention	Synthesizer
	Kernel Attention	Performer
		Poly-SA
		Linear Transformer
		SLAB
		Hydra ViT
	Hybrid Attention	EfficientFormerV2
		EfficientViT
		Next-ViT
		CvT
		ResT
		CoaT
	Fourier Attention	FNet
		GFNet
		AFNO
	Non-Attention Shuffling	MLP-Mixer
		FastViT
		EfficientMod
		FocalNet
		SwiftFormer

Token Sequence	Token Removal	DynamicViT
		A-ViT
		EViT
	Token Merging	ToMe
	Summary Tokens	CaiT
		Token Learner
		STViT

MLP Block	More MLPs	Switch Transformer
		HiViT

Experimental Design

We conduct a series of over 200 experiments on more than 45 models.

Training Pipeline

We compare models on even grounds by training from scratch with a standardized pipeline. This pipeline is based on DeiT III , an updated version of DeiT . To reduce bias, our pipeline is relatively simple and only consists of elements commonly used in CV. In particular, we refrain from using knowledge distillation to prevent introducing bias from the choice of teacher model. Any orthogonal techniques, like quantization, sample selection, and others, are not included as they can be applied to every model and would manifest as a systematic offset in the results. To avoid issues from limited training data, we pre-train all models on ImageNet-21k .

Training Hyperparameters

	Pretrain	Finetune
Dataset	ImageNet-21k	ImageNet-1k
Epochs	90	50
LR	$3 \times 10^{-3}$	$3 \times 10^{-4}
Schedule	cosine decay	cosine decay
Batch Size	2048	2048
Warmup Schedule	linear	linear
Warmup Epochs	5	5
Weight Decay	0.02	0.02
Gradient Clipping	1.0	1.0
Label Smoothing	0.1	0.1
Drop Path Rate	0.05	0.05
Optimizer	Lamb	Lamb
Dropout Rate	0.0	0.0
Mixed Precision	✅	✅
Augmentation	3-Augment	3-Augment
Image Resolution	$224 \times 224$ or $192 \times 192$	$224 \times 224$ or $384 \times 384$
GPUs	4 NVIDIA A100	4 or 8 NVIDIA A100

Efficiency Metrics

We track the following metrics for evaluating the model efficiency:

Number of Parameters
FLOPs
Training time in GPU-hours at batch size 2048 for the full 50 epochs of finetuning on an A100 GPU
Inference throughput in images per second at the optimal batch size on an A100 GPU
Training memory over all GPUs during finetuing at batch size 2048
Inference memory on a single GPU at batch size 1; the minimum amount of memory needed for inference

For comparability, the empirical metrics are measured using the same setup.

Results

Improved Training Pipeline

To validate the fairness of our training pipeline, we validate our ImageNet-1k accuracy with the original papers’ (whenever reported).

Model	Orig. DeiT	Orig. Acc.	Our Acc.	Model	Orig. DeiT	Orig. Acc.	Our Acc.
ViT-S (DeiT)	✅	79.8	82.54	ViT-S (DeiT III)		82.6	82.54
XCiT-S	✅	82.0	83.65	Swin-S	✅	83.0	84.87
Swin-V2-Ti		81.7	83.09	Wave-ViT-S		82.7	83.61
Poly-SA-ViT-S		71.48	78.34	SLAB-S	✅	80.0	78.70
EfficientFormer-V2-S0		75.7${}^D$	71.53	CvT-13		83.3$\uparrow$	82.35
CoaT-Ti	✅	78.37	78.42	EfficientViT-B2		82.7$\uparrow$	81.53
NextViT-S		82.5	83.92	ResT-S	✅	79.6	79.92
FocalNet-S		83.4	84.91	SwiftFormer-S		78.5${}^D$	76.41
FastViT-S12	✅	79.8$\uparrow$	78.77	EfficientMod-S	✅	81.0	80.21
GFNet-S		80.0	81.33	EViT-S	✅	79.4	82.29
DynamicViT-S		83.0${}^D$	81.09	EViT Fuse	✅	79.5	81.96
ToMe-ViT-S	✅	79.42	82.11	TokenLearner-ViT-8		77.87$\downarrow$	80.66
STViT-Swin-Ti	✅	80.8	82.22	CaiT-S24	✅	82.7	84.91

We find that 13 out of 26 papers base their training pipelines on DeiT, making our pipeline a good fit. Additionally, we see that with our pipeline accuracy increases by $0.85$% on average. Most models reporting higher performance using the original pipeline were trained with knowledge distillation (which we avoid to reduce bias) or using a larger image resolution (which we show is inefficient).

Number of Parameters

Use widescreen format for the best view of the interactive plots.

We find that in general, the accuracy per parameter goes down as models get larger. This is especially the case with the ViT models, which are more parameter efficient than similar accuracy models at smaller sizes (ViT-Ti) and less parameter efficient for the larger models (ViT-B). The most parameter efficient models are Hybrid Attention models (EfficientFormerV2-S0, CoaT-Ti) and other Non-attention shuffling models which incorporate convolutions (SwiftFormer, FastViT).

Speed

Inference Throughput

The models we evaluate often claim a superior throughput vs. accuracy trade-off compared to ViT. However, we find that ViT remains Pareto optimal at all model sizes. Only few models (Synthesizer-FR, NextViT, and some Token Sequence models) show improvements in the Pareto front when compared to a ViT of comparable size. We find that these observations replicate on other datasets and even when using CPUs instead of GPUs.

Training Speed

This Pareto front is very similar to the one for inference time. Here, some Token Sequence models are highly efficient; in particular TokenLearner.

Generally, ViT is still a solid choice for speed.

Memory

Training Memory

Training memory again exhibits a similar pattern as above. There is a stark contrast between models using low-resolution and high-resolution images as the ones with high-resolution images need significantly more memory with not that much accuracy gained.

Inference Memory

The Pareto front of inference memory is the most different to all the others. It is the only one where ViT is not Pareto optimal. Instead Hybrid Attention and convolution based models excel, similar to the . It is also the only setup where a model (EviT) using 384px resolution images is Pareto optimal.

Scaling Behaviors

Our observations reveal that fine-tuning at a higher resolution is inefficient. While it may result in improved accuracy, it entails a significant increase in computational cost, leading to a substantial reduction in throughput. In turn, scaling up the model ends up being more efficient. This can be seen when comparing the corresponding Pareto fronts for , , and .

A few examples for scaling the model vs. scaling the image size:

We see that scaling up the model size is always more efficient than scaling up the image resolution.

Correlation of Metrics

$\text{corr}(x, y)$	Params	Training Time	Training Memory	Inference Time	Inference Memory
FLOPS	0.30	0.72	0.85	0.48	0.42
Params		0.05	0.18	0.02	0.40
Training Time			0.89	0.81	0.17
Training Memory				0.71	0.48
Inference Time					0.13

The highest correlation of 0.89 is between fine-tuning time and training memory. This suggests a common underlying factor or bottleneck, possibly related to the necessity of memory reads during training. We find a reliability of estimating computational costs only based on theoretical metrics, like [ , ] before. Consequently, assessing model efficiency in practice requires the empirical measurement of throughput and memory requirements.

TlDr: Which Transformer to Favor?

Our benchmark offers actionable insights for answering the question of which transformer to favor in the form of models and strategies to use. We have compiled an overview of these in the flowchart above. ViT remains the preferred choice overall. However, Token Sequence methods can become viable alternatives when speed and training efficiency are of importance. For scenarios with significant inference memory constraints, considering Hybrid CNN-attention models can prove advantageous.

We additionally find that it is much more efficient to scale up the model size than to scale up the image resolution. This goes against the trend of efficient models being evaluated using higher resolution images, which cancels out possible efficiency gains.

References

For references and links to the efficient transformer models, see the .

Brian R Bartoldson, Bhavya Kailkhura, and Davis Blalock. Compute-efficient deep learning: Algorithmic trends and opportunities. Journal of Machine Learning Research, 24(122):1–77, 2023.
Mostafa Dehghani, Yi Tay, Anurag Arnab, Lucas Beyer, and Ashish Vaswani. The efficiency misnomer. In International Conference on Learning Representations, 2022.
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2009.
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Rep- resentations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herve Jegou. Training data-efficient image transformers & distillation through attention. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 10347–10357. PMLR, 7 2021.
Hugo Touvron, Matthieu Cord, and Hervé Jégou. Deit iii: Revenge of the vit. In Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors, Computer Vision – ECCV 2022, pages 516–533, Cham, 2022. Springer Nature Switzerland.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.

Associated Projects: , ,

TaylorShift: Shifting the Complexity of Self-Attention from Squared to Linear (and Back) using Taylor-Softmax

Tue, 03 Dec 2024 13:00:00 +0000

Introduction

Despite their remarkable success, Transformers face a significant challenge when dealing with long sequences due to the quadratic complexity of the attention mechanism. This limitation hinders their application to tasks involving extensive contextual information, such as processing long documents or high-resolution images. While various approaches have been proposed to address this issue, they often sacrifice accuracy, specialize in specific domains, or neglect individual token-to-token interactions. To overcome these limitations, we introduce TaylorShift, a novel method that reformulates the softmax function in the attention mechanism using the Taylor approximation of the exponential. By combining this approximation with a tensor-product-based operator, TaylorShift achieves linear-time complexity while preserving the essential token-to-token interactions. We analyze the efficiency of TaylorShift in depth, both analytically and empirically and find that it outperforms the standard transformer architecture in 4 out of 5 tasks.

How does TaylorShift work?

Essentially, TaylorShift works by replacing the exponential function in the softmax by its . For a vector $\mathbf x = [x\_1, ..., x\_m] = [x_i]_{i = 1}^m$:

$$ \text{softmax}(x) = \left[\frac{\exp(x_i)}{\sum\_{j} \exp(x_j)}\right]\_{i = 1}^m \approx \left[ \frac{\frac{x_i^2}{2} + x_i + 1}{\sum_j \frac{x_j^2}{2} + x_j + 1} \right]\_{i = 1}^m = \text{T-SM}(x) $$

Direct TaylorShift

We call the direct implementation of the using the Taylor Softmax direct-TaylorShift, as seen . For queries $Q$, keys $K$, and values $V$, this becomes:

$$ Y = \text{T-SM}(Q K^\top) V $$

Efficient TaylorShift

Direct-TaylorShift has the same scaling behavior as standard attention. However, we can reduce its computational complexity from $\mathcal O(N^2 d)$ to $\mathcal O(N d^3)$ by reordering the operations internally. This becomes useful for long sequences, where $N \gg d$.

Let me first introduce a tensor-product-based operator:

$$ \boxtimes: \mathbb R^{N \times d} \times \mathbb R^{N \times d} \to \mathbb R^{N \times d^2}. $$

Basically, we take two lists of $d$-dimensional vectors $[a\_i \in \mathbb R^d]\_i$ and $[b\_i \in \mathbb R^i]\_i$ and for each index $i$ we multiply each element of $a_i$ with all the elements of $b_i$. The result is $d^2$ dimensional, since that is the number of possible combinations. We also write $A^{\boxtimes 2} := A \boxtimes A$.

Mathematical Details

In mathematical terms, we define $$ [A \boxtimes B]_n = \iota(A_n \otimes B_n) \in \mathbb R^{d^2} \hspace{10pt} \forall n=1, ..., N $$ Here, $A_n$, $B_n$, and $[A \boxtimes B]_n$ is the $n$-th entry of the respective matrix. $\otimes$ is the tensor product (or outer product) of two $d$-dimensional vectors and $\iota: \mathbb R^{d \times d} \to \mathbb R^{d^2}$ is the canonical isomorphism (basically, it just reorders the entries of a matrix into a vector; the exact order does not matter, as long as it's always the same one).

It turns out, that by using this operator, we can calculate TaylorShift more efficiently:

$$ Y = Y_\text{nom} \oslash Y_\text{denom} = \left[ \frac{[Y_\text{nom}]\_{i, :}}{[Y_\text{denom}]\_i} \right]\_{i = 1}^N $$

with

$$ Y_\text{nom} = \frac 1 2 Q^{\boxtimes 2} \left( (K^{\boxtimes 2})^\top V \right) + Q (K^\top V) + \sum_\text{columns} V. $$

$Y_\text{denom}$ is the same, but with $\mathbb 1 = [1, ..., 1]$ instead of $V$.

Mathematical Details

We have $$ Y_\text{nom} = \frac 1 2 (Q K^\top)^{\odot 2} V + Q K^\top V + \sum_\text{columns} V. $$ Let $ \pi: \{1, .., d\} \times \{1, ..., d\} \to \{1, ..., d^2\} $ be the map that describes the reordering that $\iota$ (defined in the Mathematical Details section above) does. Then we have $$ \left[ A^{\boxtimes 2} \right]_{n, \pi(k, \ell)} = (A_n \otimes A_n)_{k, \ell} = A_{n, k} A_{n, \ell}. $$ This allows us to linearize the squared term $(Q K^\top)^{\odot 2} V$ by using $\boxtimes$ to unroll the square of a sum along a sum of $d^2$ elements: $$ \begin{align*} \left[(QK^\top)^{\odot 2} \right]_{i, j} =& \left( \sum_{k = 1}^d Q_{ik} K_{jk} \right)^2 \\ =& \sum_{k, \ell = 1}^d Q_{ik} Q_{i\ell} K_{jk} K_{j \ell} \\ =& \sum_{k, \ell = 1}^d \left[ Q^{\boxtimes 2} \right]_{i, \pi(k, \ell)} \left[ K^{\boxtimes 2} \right]_{j, \pi(k, \ell)} \\ =& \left[ Q^{\boxtimes 2} \right]_i \left[ K^{\boxtimes 2} \right]_j^\top \end{align*} $$ Therefore $$ (QK\top)^{\odot 2} V = Q^{\boxtimes 2} (K^{\boxtimes 2})^\top V, $$ which can be computed in $\mathcal O(N d^3)$ by multiplying from right to left. We can also calculate $Y_\text{nom}$ and $Y_\text{denom}$ at once by setting $V \gets V \circ \mathbb 1$.

Normalization

We found that some intermediate results of TaylorShift tended to have very large norms, which ultimately led to training failures. We introduce the following three steps for normalization:

Normalize the queries and keys to one and introduce an additional attention temperature parameter (per attention-head) $\tau \in \mathbb R$: $$ q_i \gets \frac{\tau q_i}{||q_i||_2}, \hspace{10pt} k_i \gets \frac{k_i}{||k_i||_2} \hspace{10pt} \forall i=1, ..., N $$
Counteract the scaling behaviors by multiplying $Q$ and $K$ by $\sqrt[4]{d}$ and $V$ by $\frac 1 N$.
Normalize the output by multiplying by $\sqrt{\frac N d}$.

Scaling Behavior Details

Experimentally, we find the following approximate mean sizes for intermediate results with $Q, K,$ and $V$ sampled uniformly from the unit sphere:

Interm. Expr.	$(K^{\boxtimes 2})^\top V$	$(QK^\top)^{\odot 2} V$	$ QK^\top V$	$Y_\text{denom}$	$Y$
Size ($\approx$)	$\frac{N}{\sqrt d}$	$\frac N d$	$\sqrt N (1 + \frac{1}{4d})$	$N (2 + \frac{1}{d})$	$\sqrt{\frac d N}$
Size after Normalization ($\approx$)	$1$	$1$	$\frac{1}{\sqrt{Nd}} (1 + \frac{1}{4d})$	$2 + \frac{1}{d}$	$1$

Efficient-TaylorShift Algorithm

We compile all the information into the pseudocode for efficient-TaylorShift:

Find the PyTorch implementation .

How efficient is efficient-TaylorShift?

We analyze the circumstances where efficient-TaylorShift is more efficient than direct-TaylorShift in terms of speed, based on the number of floating point operations, and memory, based on the size of intermediate results.

Floating Point Operations

The number of floating point operations for direct-TaylorShift and efficient-TaylorShift is

$$\text{ops}_\text{dir} = 4N^2 d + 6 N^2,$$

$$\text{ops}\_\text{eff} = N (4d^3 + 10d^2 + 9d + 4).$$

Therefore, there exists an $N_0 = N_0(d)$, such that efficient-TaylorShift is more efficient for all $N > N_0$. We calculate

$$ N_0 = \frac{4d^3 + 10d^2 + 9d + 4}{4d + 6} \leq d^2 + d + \frac 3 4. $$

Mathematical Details

We need the following operations:

direct-TaylorShift:

$2N^2 d$ for the multiplication of $QK^\top$,
$4N^2$ operations to apply $x \mapsto \frac 1 2 x^2 + x + 1$ element-wise to that matrix,
$2N^2$ operations for normalization,
$2N^2 d$ operations for the final multiplication by $V$ $$ \Rightarrow \text{ops}_\text{dir} = 4 N^2 d + 6 N^2 $$

efficient-TaylorShift:

$2N d^2$ operations for $K^{\boxtimes 2}$ and $Q^{\boxtimes 2}$,
$2 N d^2 (d + 1)$ operations to multiply by $V \in \mathbb R^{N \times (d+1)}$ and get $(K^{\boxtimes 2})^\top V$,
$2 N d^2 (d + 1)$ operations for the final multiplication by $Q^{\boxtimes 2}$,
$4 N d (d + 1)$ operations for computing $Q K^\top V$ from right to left,
$N (d + 1)$ operations for summing up the columns of $V$,
$3 N (d + 1)$ operations for the sums and scalar multiplication, and finally
$N d$ operations for normalization. $$ \Rightarrow \text{ops}_\text{eff} = N (2 d^2 + 4 d^2 (d + 1) + 4 d (d + 1) + 4 (d + 1) + d) $$

We derive $N_0$ by setting $\text{ops}\_\text{dir} \stackrel{!}{=} \text{ops}\_\text{eff}$, which is equivalent to

$$ N_0 = \frac{4d^3 + 10d^2 + 9d + 4}{4d + 6} \leq \frac{4d^3 + 6d^2}{4d + 6} + \frac{4d^2 + 6d}{4d + 6} + \frac{3d + 4.5}{4d + 6} = d^2 + d + \frac 3 4 $$

Size of intermediate Results

The size of the largest intermediate results needed at once for direct- and efficient-TaylorShift is

$$\text{entries}_\text{dir} = \underbrace{dN}\_{\text{for } V} + \underbrace{2N^2}\_{\text{for } QK^\top \text{ and result}},$$

$$\text{entries}\_\text{eff} = d^2(d+1) + 2dN + (d+1)N + d^2N.$$

We can again find $N_1 = N_1(d)$, such that efficient-TaylorShift is more memory efficient for all $N > N_1$. We find

$$ N_1 = \frac 1 4 \left[ d^2 + 2 d + 1 + \sqrt{d^4 + 12 d^3 + 14 d^2 + 4d + 1} \right] \leq \frac 1 2 d^2 + 2 d + \frac 1 2. $$

Mathematical Details

We count the number of entries in the largest intermediate results needed at once.

For direct-TaylorShift we need the largest intermediate memory when calculating $\text{T-SM}(QK^\top)$ from $QK^\top$.

$d N$ entries of $V$
$N^2$ entries of $QK^\top$
$N^2$ entries for the result. Note that we can’t simply reuse the memory from $QK^\top$, because we need to save at least one intermediate result when calculating $\frac 1 2 x^2 + x$.

For efficient-TaylorShift we need the most memory when calculating $(K^{\boxtimes 2})^\top V$:

$2 N d$ entries for $Q,$ and $K$ for later
$N (d + 1)$ entries for $V$ (also needed again later)
$N d^2$ entries of $K^{\boxtimes 2}$
$d^2 (d + 1)$ entries for the result

We again derive $N_1$ by setting $\text{entries}\_\text{dir} \stackrel{!}{=} \text{entries}\_\text{eff}$ for $N_1$. Therefore

$$ N_1^2 - \frac{d^2 + 2d + 1}{2} N_1 - \frac{d^3 + d^2}{2} = 0 $$

The larger of the two solutions is

$$ \begin{align*} N_1 =& \frac 1 4 \left[ d^2 + 2d + 1 + \sqrt{(d^2 + 2d + 1)^2 + 8(d^3 + d^2)} \right] \\\\ =& \frac 1 4 \left[ d^2 + 2d + 1 + \sqrt{d^4 + 12 d^3 + 14 d^2 + 4d + 1} \right]. \end{align*} $$

Since

$$ (d^2 + 6d + 1)^2 = d^4 + 12d^3 + 38 d^2 + 12 d + 1 \geq d^4 + 12 d^3 + 14 d^2 + 4d + 1 $$

we have

$$ N_1 \leq \frac 1 2 d^2 + 2 d + \frac 1 2. $$

$N_0$ and $N_1$ for typical values of $d$

Table:

d	8	16	32	64	128
$N_0$	73	273	1057	4161	16513
$N_1$	47	159	574	2174	8446

Calculator:

d =

=> N_0 = 1057 N_1 = 577

How can we increase the efficiency?

In an effort to increase the efficiency while processing the same number of tokens $N$, one might opt to reduce the embedding dimension $d_\text{emb}$. However, this might come at the cost of expressiveness. Given that efficient-TaylorShift scales with $\mathcal O(Nd^3)$, we can instead increase the number of attention heads $h$ with dimension $d = \frac{d_\text{emb}}{h}$ each. We find that the number of operations is

$$ \text{ops}\_\text{eff}(\text{MHSA}) = N \left( 4 \frac{d\_\text{emb}^3}{h^2} + 10 \frac{d\_\text{emb}^2}{h} + 9 d\_\text{emb} + 4h \right) $$

and the number of entries is

$$ \text{entries}\_\text{eff}(\text{MHSA}) = \frac{d\_\text{emb}^3}{h^2} + (N + 1) \frac{d\_\text{emb}^2}{h} + 3N d\_\text{emb} + N h, $$

which are both strictly decreasing in $h$. Therefore, efficient-TaylorShift becomes faster and needs less memory with more attention heads.

Mathematical Details

We identify the extreme points of both (as functions of $h$) by setting their derivatives to zero: $$ \frac{\partial}{\partial h} \text{ops}_\text{eff}(\text{MHSA}) = -8 \frac{d_\text{emb}^3}{h^3} - 10 \frac{d_\text{emb}^2}{h^2} + 4 $$ By setting $d = \frac{d_\text{emb}}{h}$, we find that the above is zero at $d \approx 0.52$. This would imply $h = \frac{1}{0.52} d_\text{emb}$, but the maximum value for $h$ is $d_\text{emb}$, since the number of dimensions $d$ has to be an integer.

Similarly, for the number of entries, we find:

$$ \frac{\partial}{\partial h} \text{entries}\_\text{eff}(\text{MHSA}) = -2 d^2 - (N + 1) d + N \stackrel{!}{=} 0 $$

$$ \Leftrightarrow N = (2d + N + 1) d^2 \stackrel{d > 0}{\geq} (N + 1) d^2 $$

Therefore $1 > \frac{N}{N+1} \geq d^2$ which would imply $1 > d$ and therefore $h > d_\text{emb}$ again, but the maximum value possible is $h = d_\text{emb}$.

Empirical Evaluation

Efficiency of TaylorShift Attention

We first validate our theoretical analysis on the efficiency of TaylorShift by measuring its inference time and memory usage under different configurations of $d$ and $N$:

We observe that the empirical estimate for the memory transition point $\hat N_1$ coincides almost exactly with the theoretical value $N_1$, with an error of at most $0.6\\%$. The difference between the empirical speed transition point $\hat N_0$ and the theoretical one $N_0$ is approximately proportional to $d$: $\hat N_0 - N_0 \approx 18 d$. We hypothesize that the more sequential nature of efficient-TaylorShift results in more, costly reads and writes in GPU memory. It might indicate possible efficiency gains for efficient-TaylorShift from a low-level IO-efficient implementation.

Performance of a Transformer with TaylorShift

We test the accuracy of multiple (efficient) Transformers on a set of 5 tasks from the Long Range Arena benchmark , as well as ImageNet classification at two model sizes. We use the same standard hyperparameters for all models. Models with * had to be trained in full instead of mixed precision. All tasks are classitication tasks and the table shows accuracy in percent.

Model	CIFAR (Pixel)	IMDB (Byte)	ListOps	ImageNet (Ti)	ImageNet (S)	Average
Linformer	29.2	58.1	–	64.3	76.3	(57.0)
RFA	44.9	65.8	–	–	–	(55.4)
Performer	34.2*	65.6*	35.4*	62.0*	67.1*	52.9
Reformer	44.8	63.9	47.6	73.6	76.2*	61.2
Nyströmformer	49.4	65.6	44.5	75.0	78.3*	62.6
EVA	46.1	64.0	45.3	73.4	78.2	61.4
Transformer	44.7	65.8	46.0	75.6	79.1	62.2
TaylorShift (ours)	47.6	66.2	46.1	75.0	79.3	62.8

This shows TaylorShift’s consistent performance across various datasets. It surpasses all other models on 4 out of 5 datasets, positioning itself as a good all-rounder model. We observe a notable increase of $4.3\\%$ when transitioning from size Ti to S on ImageNet, in contrast to $3.5\\%$ for the Transformer, which highlights TaylorShifts scalability.

Number of attention heads

We train TaylorShift models on the pixel-level CIFAR10 task to see how accuracy and efficiency change. All models have the default $d_\text{emb} = 256$ with $d = \frac{d_\text{emb}}{h}$ in the attention mechanism. The default is $h = 4$.

$h$	$d$	Acc [%]	dir-TS TP [ims/s]	dir-TS Mem [MiB@16]	eff-TS TP [ims/s]	eff-TS Mem [MiB@16]
4	64	47.1	12060	596	2975	840
8	32	47.5	7657	1111	5749	585
16	16	47.3	4341	2135	9713	459
32	8	46.9	2528	4187	14087	397
64	4	45.9	1235	8291	13480	125

We see that increasing the number of attention heads $h$ increases the speed and decreases the memory needed by efficient-TaylorShift, as predicted. Additionally, we find that it also increases the performance up to a certain point. Until there, we have a win-win-win situation with a faster, more lightweight and more accurate model. After that the number of heads can be used to trade off accuracy against the amount compute needed.

Conclusion & Outlook

We introduced TaylorShift a novel efficient Transformer model. It offers significant computational advantages without sacrificing performance. By approximating the exponential function, TaylorShift achieves linear time and memory complexity, making it ideal for long sequences. Our experiments demonstrate its superiority over standard Transformers in terms of speed, memory efficiency, and even accuracy.

As we move forward, we envision TaylorShift opening up new possibilities for tackling challenging tasks involving lengthy sequences. From high-resolution image processing to large-scale document analysis, TaylorShift’s efficiency and versatility make it a promising tool for the future of efficient Transformer models.

For more details, see the or the .

References

K.M. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlos, P. Hawkins, J.Q. Davis, A. Mohiuddin, L. Kaiser, D.B. Belanger, L.J. Colwell, and A. Weller “Rethinking attention with performers”. ICLR 2021.
N. Kitaev, L. Kaiser, and A. Levskaya. “Reformer: The efficient transformer”. ICLR 2020.
H. Peng, N. Pappas, D. Yogatama, R. Schwartz, N.A. Smith, and L. Kong “Random feature attention”. ICLR 2021.
Y. Tay, M. Dehghani, S. Abnar, Y. Shen, D. Bahri, P. Pham, J. Rao, L. Yang, S. Ruder, and D. Metzler “Long range arena: A benchmark for efficient transformers” ICLR 2021.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser, and I. Polosukhin “Attention is all you need”. NeurIPS 2017.
S. Wang, B.Z. Li, M. Khabsa, H. Fang, and H. Ma “Linformer: Self-attention with linear complexity”. ArXiv Prerint 2020.
Y. Xiong, Z. Zeng, R. Chakraborty, M. Tan ,G. Fung, Y. Li, and V. Singh “Nyströmformer: A nyström-based algorithm for approximating self-attention”. AAAI 2021.
L. Zheng, J. Yuan, C. Wang, and L. Kong “Efficient attention via control variates”. ICLR 2023.

Associated Projects: ,