ForAug: Recombining Foregrounds and Backgrounds to Improve Vision Transformer Training with Bias Mitigation

Abstract

Transformers, particularly Vision Transformers (ViTs), have achieved state-of-the-art performance in large-scale image classification. However, they often require large amounts of data and can exhibit biases that limit their robustness and generalizability. This paper introduces ForAug, a novel data augmentation scheme that addresses these challenges and explicitly includes inductive biases, which commonly are part of the neural network architecture, into the training data. ForAug is constructed by using pretrained foundation models to separate and recombine foreground objects with different backgrounds, enabling fine-grained control over image composition during training. It thus increases the data diversity and effective number of training samples. We demonstrate that training on ForNet, the application of ForAug to ImageNet, significantly improves the accuracy of ViTs and other architectures by up to 4.5 percentage points (p.p.) on ImageNet and 7.3 p.p. on downstream tasks. Importantly, ForAug enables novel ways of analyzing model behavior and quantifying biases. Namely, we introduce metrics for background robustness, foreground focus, center bias, and size bias and show that training on ForNet substantially reduces these biases compared to training on ImageNet. In summary, ForAug provides a valuable tool for analyzing and mitigating biases, enabling the development of more robust and reliable computer vision models.

Publication
arXiv preprint

gif

Introduction

Image classification – teaching computers to label images – is a cornerstone of AI vision, powering everything from medical diagnosis to autonomous driving. Datasets like ImageNet have been crucial, especially with the rise of powerful models like Vision Transformers (ViTs).

However, unlike older Convolutional Neural Networks (CNNs), ViTs don’t inherently understand that an object remains the same regardless of its position in an image (they lack “translation equivariance”). Standard data augmentation techniques (like flipping or cropping) help, but they weren’t specifically designed for this trait of ViTs.

To tackle these problems, we propose ForAug, a novel data augmentation for ViTs. The core idea? Make the spatial relationships explicit in the training data. ForAug achieves this by:

  1. Separating foreground objects from their backgrounds in the dataset.
  2. Recombining these objects with different backgrounds on-the-fly during training.
  3. Controlling the object’s size and position during this recombination.

This process creates ForNet, a dynamic version of ImageNet where models see the same objects against varied backgrounds and in different locations epoch after epoch.

The results? Training ViTs on ForNet instead of standard ImageNet boosts accuracy by up to 4.5 percentage points on ImageNet classification and significantly cuts error rates (up to 39.3% reduction) on downstream tasks.

Furthermore, ForAug provides powerful new ways to analyze model biases. Researchers can now precisely measure:

  • Background Robustness: How much does the background influence the prediction?
  • Foreground Focus: Does the model correctly focus on the main object?
  • Center & Size Bias: Is the model overly reliant on objects being centered or a specific size?

Training on ForNet demonstrably reduces these biases, leading to more robust models.

This post is just a short overview over ForAug and ForNet. For more information, see the paper pdf.

ForAug (Method)

ForAug Flowchart
So, how does ForAug actually build these dynamic training images? The process, visualized above, involves two main stages: an offline Segmentation stage and an online Recombination stage.

Segmentation

The process kicks off with the Segmentation stage, a one-time, offline preparation step performed before model training even begins. Think of it as carefully prepping the visual ingredients. Here, we leverage the state-of-the-art Grounded SAM segmentation model, guiding it with the known class label of each image (e.g., instructing it to specifically find the ‘golden retriever’) to precisely isolate the main subject. Once the foreground object is digitally ‘cut out’, an object removal or ‘inpainting’ model intelligently fills the resulting hole in the original background, ensuring the backdrop looks natural and plausible. Crucially, not all generated assets make the cut; a filtering step employs other pre-trained AI models to assess quality. This ensures only clearly defined foregrounds and clean backgrounds – ones that don’t inadvertently give away the object’s identity or look overly artificial – are selected. This meticulous preparation yields the core assets for ForAug: a collection of ready-to-use foreground objects (with transparency) and a diverse pool of cleaned-up backgrounds.

Recombination

With the assets prepared, the real action unfolds during the Recombination stage, which happens dynamically online while the Vision Transformer is training. This is where the ForNet dataset truly comes alive, creating new training examples on the fly. For every foreground object prepared in the first stage, the system randomly selects a background to pair it with. This background might be the object’s original one, perhaps one from another image belonging to the same object class, or even a completely unrelated background drawn from the entire dataset to maximize contextual variety. The chosen foreground object is then randomly resized (within sensible limits based on its original appearance) and placed at a random position onto this background canvas. To create a more seamless integration, a subtle smoothing effect is applied to the object’s edges where it meets the new background. Only after this dynamic composition is complete does the resulting image undergo the standard data augmentation techniques commonly used in AI training, like random color shifts or minor flips. This constant mixing-and-matching means that each time the AI cycles through the training data, it encounters familiar objects in entirely new visual contexts. This directly forces the ViT to learn robust features that identify the object itself, effectively teaching it the spatial invariance that doesn’t come built-in, by demonstrating repeatedly that appearance, not specific placement or background, is what defines the object.

Experiments

Image Classification Results

We compare training on ForNet to training on ImageNet for 6 different models:

ImageNet results
We find that training on ForNet increases the accuracy of every model by up to 4.5%. It also combats the overfitting problem of larger models.

Downstream Results
When finetuning these models on 5 fine-grained down-stream datasets, we find that the ForNet-pretrained models consistently outperform the ImageNet-pretrained ones. Especially when looking at the transformer-based models.

Model Robustness

We also evaluate multiple robustness metrics.

Background Robustness

We check the background robustness of models, by inspecting the accuracy-change when evaluating on ForNet with backgrounds from the same class compared to backgrounds from all classes: $$ \text{Background Robustness}(M) = \frac{\text{Acc}[M; \textit{ForNet}(\text{all})]}{\text{Acc}[M; \textit{ForNet}(\text{same})]} $$

Background Robustness Scores
Training on ForNet increases the Background Robustness to 100%, meaning that the model is independent to the choice of (natural) background distribution.

Foreground Focus

Since we have the foreground segmentation masks, we can also investigate the foreground focus of the trained models. For this, we utilize different input-importance metrics like GradCAM and IntegratedGradients (IG). We define a models foreground focus, by how much more it focuses on the foreground object compared to a uniform distribution: $$ \text{FG Focus}(M; \text{img}) = \frac{\text{Area}(\text{img}) \hspace{5pt} \text{Importance}_M(\text{fg})}{\text{Area}(\text{fg}) \hspace{5pt} \text{Importance}_M(\text{img})} $$

Foreground Focus Scores
We find that training on ForNet mostly significantly improves the foreground focus of a model; especially ViT.

Center Bias

Since we can freely change the object’s position and size, we can evaluate the model bias when the position changes. For this, we subdivide the image into $3 \times 3$ sections (nonants) and place each object only in one nonant. We then compare the accuracy of a model when an object is in a specific nonant to when it’s in the center nonant.

Our center-bias score is defined at the mean of (1) the worst accuracy in a corner and (2) the worst accuracy on an edge, relative to the accuracy in the center.

Center Bias Table

We visualize the center bias for 3 instantiations of each model. Training on ForNet (instead of ImageNet) significantly reduces the center bias; especially of larger transformers. We also find that when training on ImageNet, model consistently perform better when an object is on the right side of an image compared to the left side (even though we use 50% random flipping during training of all models).

Size Bias

We vary the object size by an additional factor of $f_\text{size}$ to see how the model accuracy changes relative to $f_\text{size} = 1$.

Size Bias Plot

Using ForNet significantly reduces the accuracy drop-off when going towards smaller objects. These gains come on top off the overall better accuracy (at $f_\text{size} = 1$).

Conclusion

So, what’s the big takeaway from ForAug and the resulting ForNet dataset? This research introduces a genuinely novel data augmentation scheme designed specifically to enhance how Vision Transformers learn to classify images. By cleverly separating foreground objects from their backgrounds and dynamically recombining them during training, ForAug tackles a key characteristic of Transformer models head-on.

As the results clearly demonstrate, this dynamic approach pays off significantly. Training models on ForNet leads to substantial performance boosts on the standard ImageNet benchmark and translates to impressive gains on related fine-grained classification tasks downstream.

But the impact of ForAug extends beyond just improving accuracy scores. It also provides a powerful and much-needed framework for analyzing model behavior and uncovering hidden biases. Crucially, the experiments show that training on ForNet doesn’t just highlight these biases – it actively reduces them. This results in models that are not only more accurate but also more robust, reliable, and generalizable to varied real-world conditions.

Associated Projects: SEmbedAI, SustAInML, Albatross

Tobias Christian Nauen
Tobias Christian Nauen
PhD Student

My research interests include efficiency of machine learning models, multimodal learning, and transformer models.