Sub-Quadratic Vision Transformers Cut Self-Attention Cost for Faster Image Captioning

A new arXiv preprint from Ghosh et al. proposes a sub-quadratic vision transformer architecture for image captioning. By replacing standard self-attention with a Gaussian Mixture Model (GMM) clustering mechanism, the model reduces computational complexity from quadratic O(n²) to linear O(nK). The approach uses an autoregressive GPT-based decoder and achieves competitive results on the Flickr30K dataset.

iGEN Editorial

June 16, 2026

Sub-Quadratic Vision Transformers Cut Self-Attention Cost for Faster Image Captioning

Image captioning requires a deep understanding of visual content and the ability to generate natural language descriptions. Transformer-based vision architectures have advanced the field, but their standard self-attention mechanism suffers from quadratic computational complexity O(n²) with respect to the number of image patches, limiting scalability and speed.

According to a new arXiv preprint by Ghosh, Chiradeep, Kisku, and Dakshina Ranjan, the proposed model addresses this bottleneck by restructuring the vision transformer architecture. Instead of computing pairwise attention among all image patches, the model applies a probabilistic transformer approach based on a Gaussian Mixture Model (GMM), a soft-clustering technique.

GMM-Based Clustering Reduces Complexity

The core innovation replaces standard self-attention with an Expectation-Maximization (EM) algorithm that groups similar image patches into a fixed number of clusters. The computational complexity drops from quadratic O(n²) to linear O(nK), where K is the number of clusters and K << n. This makes the architecture suitable for high-resolution images or real-time applications where traditional vision transformers may be too slow.

Aspect	Standard Self-Attention	Proposed GMM Clustering
Complexity	O(n²)	O(nK) with K << n
Mechanism	Pairwise attention among all patches	Soft-clustering via EM algorithm
Scalability	Poor for large n	Linear, more scalable

The model uses an autoregressive GPT-based decoder for caption generation, leveraging the language modelling strengths of the GPT architecture.

Evaluation on Flickr30K Dataset

The model was evaluated on the Flickr 30K dataset, which contains over 31,000 images with five captions each. The paper reports competitive and significant improvement over existing works, though specific numerical metrics are not detailed in the source.

The authors state that existing transformer-based approaches often suffer from a lack of rich local feature representations and the high computational cost of quadratic self-attention. Their clustering approach addresses both limitations by grouping patches into clusters, which can capture local patterns more effectively while reducing compute.

Implications for Efficient Vision AI

For enterprise technology leaders exploring vision AI in logistics or supply chain — such as automated inspection, document processing, or warehouse monitoring — the reduced complexity directly translates to lower inference latency and hardware requirements. The ability to process more image patches without quadratic blowup enables deployment on edge devices or in high-throughput pipelines. The paper's approach demonstrates that sub-quadratic attention is achievable without sacrificing captioning quality, as evidenced by its competitive performance on the Flickr30K benchmark.

Sources:

Sub-Quadratic Vision Transformers Cut Self-Attention Cost for Faster Image Captioning

GMM-Based Clustering Reduces Complexity

Evaluation on Flickr30K Dataset

Implications for Efficient Vision AI

Recommended Stories

Learned Image Compression Framework SPARC Boosts VLA Robot Control Performance in Bandwidth-Limited Settings

PURe Module Enhances Vision Networks by Adding Multiplicative Local Interactions

New Automated Quantization Framework AQ4SViT Compresses Spiking Vision Transformers for Embedded AI

New Research Reveals How Visual Tokens Evolve Inside Vision-Language Models