Image captioning requires a deep understanding of visual content and the ability to generate natural language descriptions. Transformer-based vision architectures have advanced the field, but their standard self-attention mechanism suffers from quadratic computational complexity O(n²) with respect to the number of image patches, limiting scalability and speed.
According to a new arXiv preprint by Ghosh, Chiradeep, Kisku, and Dakshina Ranjan, the proposed model addresses this bottleneck by restructuring the vision transformer architecture. Instead of computing pairwise attention among all image patches, the model applies a probabilistic transformer approach based on a Gaussian Mixture Model (GMM), a soft-clustering technique.
GMM-Based Clustering Reduces Complexity
The core innovation replaces standard self-attention with an Expectation-Maximization (EM) algorithm that groups similar image patches into a fixed number of clusters. The computational complexity drops from quadratic O(n²) to linear O(nK), where K is the number of clusters and K << n. This makes the architecture suitable for high-resolution images or real-time applications where traditional vision transformers may be too slow.
| Aspect | Standard Self-Attention | Proposed GMM Clustering |
|---|---|---|
| Complexity | O(n²) | O(nK) with K << n |
| Mechanism | Pairwise attention among all patches | Soft-clustering via EM algorithm |
| Scalability | Poor for large n | Linear, more scalable |
The model uses an autoregressive GPT-based decoder for caption generation, leveraging the language modelling strengths of the GPT architecture.
Evaluation on Flickr30K Dataset
The model was evaluated on the Flickr 30K dataset, which contains over 31,000 images with five captions each. The paper reports competitive and significant improvement over existing works, though specific numerical metrics are not detailed in the source.
The authors state that existing transformer-based approaches often suffer from a lack of rich local feature representations and the high computational cost of quadratic self-attention. Their clustering approach addresses both limitations by grouping patches into clusters, which can capture local patterns more effectively while reducing compute.
Implications for Efficient Vision AI
For enterprise technology leaders exploring vision AI in logistics or supply chain — such as automated inspection, document processing, or warehouse monitoring — the reduced complexity directly translates to lower inference latency and hardware requirements. The ability to process more image patches without quadratic blowup enables deployment on edge devices or in high-throughput pipelines. The paper's approach demonstrates that sub-quadratic attention is achievable without sacrificing captioning quality, as evidenced by its competitive performance on the Flickr30K benchmark.