model compression

4 stories

Artificial Intelligence #kl divergence#attention distillation

StreamKL Delivers up to 43× Speedup in Memory-Efficient Attention Distillation

Researchers propose StreamKL, a fused GPU primitive for Kullback-Leibler divergence in attention distillation. It eliminates quadratic memory materialization, enabling up to 43× and 14× speedups in forward and backward passes, and reduces extra HBM footprint to O(1).

Jun 21, 2026 1 source

New Drift-RAE Method Distills Transformers Efficiently Using Representation Autoencoders

Technology

Artificial Intelligence #transformers#representation autoencoders

New Drift-RAE Method Distills Transformers Efficiently Using Representation Autoencoders

A new research paper proposes Drift-RAE, a method for distilling pretrained flow models in representation autoencoder latent spaces. It overcomes anisotropy and large curvature challenges, achieving 1.77 FID on ImageNet 256 with only 10,000 distillation steps, outperforming existing RAE distillation methods.

Jun 16, 2026 1 source

Lightweight Hardware-Aware Neural Architecture Search Enables CNNs on Ultra-Low-Power Microcontrollers

Technology

Artificial Intelligence #neural architecture search#hardware-aware

Lightweight Hardware-Aware Neural Architecture Search Enables CNNs on Ultra-Low-Power Microcontrollers

A new hardware-aware neural architecture search (HW-NAS) method generates tiny convolutional neural networks (CNNs) suitable for ultra-low-power microcontrollers, using a lightweight search procedure that can execute on embedded devices. Empirical results on three tiny computer vision benchmarks show it preserves state-of-the-art classification accuracy, addressing the power limitations of sensing nodes.

Jun 16, 2026 1 source

New Automated Quantization Framework AQ4SViT Compresses Spiking Vision Transformers for Embedded AI

Technology

Artificial Intelligence #ai#quantization

New Automated Quantization Framework AQ4SViT Compresses Spiking Vision Transformers for Embedded AI

Researchers propose AQ4SViT, an automated quantization framework for Spiking Vision Transformers that uses a search gating policy to find optimal compression settings. It offers two variants: Greedy search for speed and Beam search for deeper compression. Experimental results on ImageNet show up to 6.6x faster search time and up to 90% memory savings while maintaining accuracy within 1.5% of the original model.

Jun 16, 2026 1 source