Topic
vision transformers
Sub-Quadratic Vision Transformers Cut Self-Attention Cost for Faster Image Captioning
A new arXiv preprint from Ghosh et al. proposes a sub-quadratic vision transformer architecture for image captioning. By replacing standard self-attention with a Gaussian Mixture Model (GMM) clustering mechanism, the model reduces computational complexity from quadratic O(n²) to linear O(nK). The approach uses an autoregressive GPT-based decoder and achieves competitive results on the Flickr30K dataset.
New Automated Quantization Framework AQ4SViT Compresses Spiking Vision Transformers for Embedded AI
Researchers propose AQ4SViT, an automated quantization framework for Spiking Vision Transformers that uses a search gating policy to find optimal compression settings. It offers two variants: Greedy search for speed and Beam search for deeper compression. Experimental results on ImageNet show up to 6.6x faster search time and up to 90% memory savings while maintaining accuracy within 1.5% of the original model.