Topic
transformer
Parallel Hybrid Architecture Combines GSS and Attention for Efficient Long-Context Language Modeling
Researchers propose the Parallel Hybrid Architecture (PHA), combining Gated State Spaces, Grouped Query Attention, and Feed-Forward Networks in parallel branches fused by a learnable mixing mechanism. On WikiText-103, PHA achieves 16.51 PPL at 125M parameters, outperforming comparable models, and scales to 180M parameters with 16.42 PPL while delivering 24% higher throughput and up to 40% lower memory usage.
PolyKV: Layer-Wise KV Cache Compression Boosts LLM Inference Efficiency by Up to 54.5%
PolyKV is a new framework for compressing the key-value cache in large language model inference. It selects a compression policy per transformer layer and allocates non-uniform cache budgets, outperforming uniform approaches. On LongBench tasks, PolyKV recovers 40%-54.5% of the performance gap between the strongest single-policy baseline and full KV cache.
VigilFormer: Deformable Attention for Video Anomaly Detection with Causal Risk Inference
A new AI framework, VigilFormer, uses deformable attention and causal inference to detect anomalies in surveillance video at 41.5 FPS, outperforming prior methods on three benchmarks.
NVIDIA Open-Sources Nemotron 3 Ultra: 550B-Parameter Hybrid Mamba-Transformer Model for Agentic AI
NVIDIA introduced Nemotron 3 Ultra, a 550 billion total parameter Mixture-of-Experts language model with a hybrid Mamba-Attention architecture. Only 55 billion parameters are active per inference. Pre-trained on 20 trillion tokens and supporting a 1 million token context length, the model achieves up to 6x higher inference throughput versus state-of-the-art public LLMs while matching accuracy. All checkpoints, training data, and recipes are open-sourced on HuggingFace.