When Pretty Isn't Useful: Investigating Why Modern Text-to-Image Models Fail as Reliable Training Data Generators

Mon, 23 Feb 2026 00:00:00 +0000

For more information, see the .

PRISM: Diversifying Dataset Distillation by Decoupling Architectural Priors

Thu, 13 Nov 2025 13:00:00 +0000

For more information, see the .

HyperCore: Coreset Selection under Noise via Hypersphere Models

Fri, 26 Sep 2025 13:00:00 +0000

For more information, see the .

SubZeroCore: A Submodular Approach with Zero Training for Coreset Selection

Fri, 26 Sep 2025 13:00:00 +0000

For more information, see the .

ForAug: Recombining Foregrounds and Backgrounds to Improve Vision Transformer Training with Bias Mitigation

Wed, 12 Mar 2025 00:00:00 +0000

Introduction

Image classification – teaching computers to label images – is a cornerstone of AI vision, powering everything from medical diagnosis to autonomous driving. Datasets like ImageNet have been crucial, especially with the rise of powerful models like Vision Transformers (ViTs).

However, unlike older Convolutional Neural Networks (CNNs), ViTs don’t inherently understand that an object remains the same regardless of its position in an image (they lack “translation equivariance”). Standard data augmentation techniques (like flipping or cropping) help, but they weren’t specifically designed for this trait of ViTs.

To tackle these problems, we propose ForAug, a novel data augmentation for ViTs. The core idea? Make the spatial relationships explicit in the training data. ForAug achieves this by:

Separating foreground objects from their backgrounds in the dataset.
Recombining these objects with different backgrounds on-the-fly during training.
Controlling the object’s size and position during this recombination.

The results? Training ViTs with ForAug instead of standard ImageNet boosts accuracy by up to 4.5 percentage points on ImageNet classification and significantly cuts error rates (up to 39.3% reduction) on downstream tasks.

Furthermore, ForAug provides powerful new ways to analyze model biases. Researchers can now precisely measure:

Background Robustness: How much does the background influence the prediction?
Foreground Focus: Does the model correctly focus on the main object?
Center & Size Bias: Is the model overly reliant on objects being centered or a specific size?

Training with ForAug demonstrably reduces these biases, leading to more robust models.

This post is just a short overview over ForAug. For more information, see the .

ForAug (Method)

So, how does ForAug actually build these dynamic training images? The process, visualized above, involves two main stages: an offline Segmentation stage and an online Recombination stage.

Segmentation

The process kicks off with the Segmentation stage, a one-time, offline preparation step performed before model training even begins. Think of it as carefully prepping the visual ingredients. Here, we leverage the state-of-the-art Grounded SAM segmentation model, guiding it with the known class label of each image (e.g., instructing it to specifically find the ‘golden retriever’) to precisely isolate the main subject. Once the foreground object is digitally ‘cut out’, an object removal or ‘inpainting’ model intelligently fills the resulting hole in the original background, ensuring the backdrop looks natural and plausible. Crucially, not all generated assets make the cut; a filtering step employs other pre-trained AI models to assess quality. This ensures only clearly defined foregrounds and clean backgrounds – ones that don’t inadvertently give away the object’s identity or look overly artificial – are selected. This meticulous preparation yields the core assets for ForAug: a collection of ready-to-use foreground objects (with transparency) and a diverse pool of cleaned-up backgrounds.

Recombination

With the assets prepared, the real action unfolds during the Recombination stage, which happens dynamically online while the Vision Transformer is training. This is where ForAug truly comes alive, creating new training examples on the fly. For every foreground object prepared in the first stage, the system randomly selects a background to pair it with. This background might be the object’s original one, perhaps one from another image belonging to the same object class, or even a completely unrelated background drawn from the entire dataset to maximize contextual variety. The chosen foreground object is then randomly resized (within sensible limits based on its original appearance) and placed at a random position onto this background canvas. To create a more seamless integration, a subtle smoothing effect is applied to the object’s edges where it meets the new background. Only after this dynamic composition is complete does the resulting image undergo the standard data augmentation techniques commonly used in AI training, like random color shifts or minor flips. This constant mixing-and-matching means that each time the AI cycles through the training data, it encounters familiar objects in entirely new visual contexts. This directly forces the ViT to learn robust features that identify the object itself, effectively teaching it the spatial invariance that doesn’t come built-in, by demonstrating repeatedly that appearance, not specific placement or background, is what defines the object.

Experiments

Image Classification Results

We compare training on ImageNet with and without ForAug for 10 different models:

We find that training with ForAug increases the accuracy of every model by up to 4.5%. It also combats the overfitting problem of larger models.

When finetuning these models on 5 fine-grained down-stream datasets, we find that the ForAug-pretrained models consistently outperform the ImageNet-pretrained ones. Especially when looking at the transformer-based models.

Model Robustness

We also evaluate multiple robustness metrics.

Background Robustness

We check the background robustness of models, by inspecting the accuracy-change when evaluating with ForAug using backgrounds from the same class compared to backgrounds from all classes:

Training with ForAug reduces the Background Gap for all transformer models.

Foreground Focus

Since we have the foreground segmentation masks, we can also investigate the foreground focus of the trained models. For this, we utilize different input-importance metrics like GradCAM and IntegratedGradients (IG). We define a models foreground focus, by how much more it focuses on the foreground object compared to a uniform distribution:

$$ \text{FG Focus}(M; \text{img}) = \frac{\text{Area}(\text{img}) \hspace{5pt} \text{Importance}_M(\text{fg})}{\text{Area}(\text{fg}) \hspace{5pt} \text{Importance}_M(\text{img})} $$

We find that training with ForAug mostly significantly improves the foreground focus of all models.

Center Bias

Since we can freely change the object’s position and size, we can evaluate the model bias when the position changes. For this, we subdivide the image into $3 \times 3$ sections (nonants) and place each object only in one nonant. We then compare the accuracy of a model when an object is in a specific nonant to when it’s in the center nonant.

Our center-bias score is defined at the mean of (1) the worst accuracy in a corner and (2) the worst accuracy on an edge, relative to the accuracy in the center.

We visualize the center bias for 3 instantiations of each model. Training with ForAug significantly reduces the center bias; especially of larger transformers. We also find that when training on ImageNet, model consistently perform better when an object is on the right side of an image compared to the left side (even though we use 50% random flipping during training of all models).

Size Bias

We vary the object size by an additional factor of $f_\text{size}$ to see how the model accuracy changes relative to $f_\text{size} = 1$.

Using ForAug significantly reduces the accuracy drop-off when going towards smaller objects. These gains come on top off the overall better accuracy (at $f_\text{size} = 1$).

Conclusion

So, what’s the big takeaway from ForAug? This research introduces a genuinely novel data augmentation scheme designed specifically to enhance how Vision Transformers learn to classify images. By cleverly separating foreground objects from their backgrounds and dynamically recombining them during training, ForAug tackles a key characteristic of Transformer models head-on.

As the results clearly demonstrate, this dynamic approach pays off significantly. Training models with ForAug leads to substantial performance boosts on the standard ImageNet benchmark and translates to impressive gains on related fine-grained classification tasks downstream.

But the impact of ForAug extends beyond just improving accuracy scores. It also provides a powerful and much-needed framework for analyzing model behavior and uncovering hidden biases. Crucially, the experiments show that training with ForAug doesn’t just highlight these biases – it actively reduces them. This results in models that are not only more accurate but also more robust, reliable, and generalizable to varied real-world conditions.

Associated Projects: , ,

A Study in Dataset Distillation for Image Super-Resolution

Wed, 05 Feb 2025 13:00:00 +0000

For more information, see the .

Associated Projects: ,

Distill the Best, Ignore the Rest: Improving Dataset Distillation with Loss-Value-Based Pruning

Mon, 18 Nov 2024 13:00:00 +0000

For more information, see the .

Associated Projects: , ,

Just Leaf It: Accelerating Diffusion Classifiers with Hierarchical Class Pruning

Mon, 18 Nov 2024 13:00:00 +0000

For more information, see the .

Associated Projects: , ,

A Low-Resolution Image is Worth 1x1 Words: Enabling Fine Image Super-Resolution with Transformers and TaylorShift

Fri, 15 Nov 2024 13:00:00 +0000

This work builds on the attention mechanism.

For more information, see the .

Associated Projects: , ,

Article | Tobias Nauen

When Pretty Isn't Useful: Investigating Why Modern Text-to-Image Models Fail as Reliable Training Data Generators

PRISM: Diversifying Dataset Distillation by Decoupling Architectural Priors

HyperCore: Coreset Selection under Noise via Hypersphere Models

SubZeroCore: A Submodular Approach with Zero Training for Coreset Selection

ForAug: Recombining Foregrounds and Backgrounds to Improve Vision Transformer Training with Bias Mitigation

Introduction

ForAug (Method)

Segmentation

Recombination

Experiments

Image Classification Results

Model Robustness

Background Robustness

Foreground Focus

Center Bias

Size Bias

Conclusion

A Study in Dataset Distillation for Image Super-Resolution

Distill the Best, Ignore the Rest: Improving Dataset Distillation with Loss-Value-Based Pruning

Just Leaf It: Accelerating Diffusion Classifiers with Hierarchical Class Pruning

A Low-Resolution Image is Worth 1x1 Words: Enabling Fine Image Super-Resolution with Transformers and TaylorShift