Artificial Intelligence #vision transformers#image captioning
Sub-Quadratic Vision Transformers Cut Self-Attention Cost for Faster Image Captioning
A new arXiv preprint from Ghosh et al. proposes a sub-quadratic vision transformer architecture for image captioning. By replacing standard self-attention with a Gaussian Mixture Model (GMM) clustering mechanism, the model reduces computational complexity from quadratic O(n²) to linear O(nK). The approach uses an autoregressive GPT-based decoder and achieves competitive results on the Flickr30K dataset.
Jun 16, 2026 1 source