transformer

9 stories

Artificial Intelligence #transformer#feed-forward

Transformer Feed-Forward Block Linearity: Learned, Not Architectural, According to New Research

A new study introduces R^2_lin, a measure of linearity for transformer feed-forward blocks. Across models like GPT-2 and Pythia-160m, R^2_lin varies widely and is not determined by activation function. The findings offer targeted compression signals and reveal pitfalls in training linear baselines.

Jun 20, 2026 1 source

NeuronFabric Architecture Cuts Memory for On-Chip Transformer Training, Promises Efficient Edge AI

Technology

Artificial Intelligence #neuronfabric#software

NeuronFabric Architecture Cuts Memory for On-Chip Transformer Training, Promises Efficient Edge AI

A new software reference architecture called NeuronFabric, detailed in an arXiv paper by Evgeny Ukladchikov, demonstrates on-chip transformer training with local Adam updates. The BF16W variant reduces memory requirements by approximately 16.5% compared to FP32, achieving 4.0 MB to 3.34 MB for a 334K-parameter model, enabling deployment on Xilinx ZCU102 devices. The C# prototype produces coherent text with loss comparable to an FP32 GPU reference.

Jun 16, 2026 1 source

Why Low-Precision Transformer Training Fails: Research Explains Flash Attention Instability

Technology

Artificial Intelligence #low-precision#transformer

Why Low-Precision Transformer Training Fails: Research Explains Flash Attention Instability

A new paper from researchers Qiu and Yao provides the first mechanistic explanation of why low-precision training with flash attention fails catastrophically. The authors identify two intertwined phenomena—emergent low-rank representations and biased rounding errors—and introduce a minimal modification that stabilizes training.

Jun 16, 2026 1 source

Reservoir Attention Network: Cross-Pass State in Pretrained Transformers via Content-Addressable Reservoir Injection

Technology

Artificial Intelligence #reservoir attention network#cross-pass state

Reservoir Attention Network: Cross-Pass State in Pretrained Transformers via Content-Addressable Reservoir Injection

The Reservoir Attention Network (RAN) injects a fixed, randomly-initialized reservoir into mid-layer attention of pretrained transformers to carry state across forward passes. Experiments on GPT-2 and Qwen2.5 on a single consumer GPU show feasibility for cross-pass state, with broader always-alive agent vision as future work.

Jun 16, 2026 1 source

Controlled Dynamics Attractor Transformer: New Model Targets Graph Anomaly Detection with Biologically Plausible Attention

Technology

Artificial Intelligence #transformer#deep learning

Controlled Dynamics Attractor Transformer: New Model Targets Graph Anomaly Detection with Biologically Plausible Attention

Researchers propose the Controlled Dynamics Attractor Transformer (CDAT), which integrates a mixture von Mises-Fisher attention energy with Hopfield refinement and excitation-inhibition modulation from neural attractor models. The model achieves state-of-the-art results on graph anomaly detection and classification benchmarks, offering potential for detecting fraud, cyber threats, and operational anomalies in supply chain networks.

Jun 16, 2026 1 source

Parallel Hybrid Architecture Combines GSS and Attention for Efficient Long-Context Language Modeling

Technology

Artificial Intelligence #long-context#transformer

Parallel Hybrid Architecture Combines GSS and Attention for Efficient Long-Context Language Modeling

Researchers propose the Parallel Hybrid Architecture (PHA), combining Gated State Spaces, Grouped Query Attention, and Feed-Forward Networks in parallel branches fused by a learnable mixing mechanism. On WikiText-103, PHA achieves 16.51 PPL at 125M parameters, outperforming comparable models, and scales to 180M parameters with 16.42 PPL while delivering 24% higher throughput and up to 40% lower memory usage.

Jun 16, 2026 1 source

PolyKV: Layer-Wise KV Cache Compression Boosts LLM Inference Efficiency by Up to 54.5%

Technology

Artificial Intelligence #kv cache#compression

PolyKV: Layer-Wise KV Cache Compression Boosts LLM Inference Efficiency by Up to 54.5%

PolyKV is a new framework for compressing the key-value cache in large language model inference. It selects a compression policy per transformer layer and allocates non-uniform cache budgets, outperforming uniform approaches. On LongBench tasks, PolyKV recovers 40%-54.5% of the performance gap between the strongest single-policy baseline and full KV cache.

Jun 16, 2026 1 source

VigilFormer: Deformable Attention for Video Anomaly Detection with Causal Risk Inference

Technology

Artificial Intelligence #video anomaly detection#deformable attention

VigilFormer: Deformable Attention for Video Anomaly Detection with Causal Risk Inference

A new AI framework, VigilFormer, uses deformable attention and causal inference to detect anomalies in surveillance video at 41.5 FPS, outperforming prior methods on three benchmarks.

Jun 16, 2026 1 source

NVIDIA Open-Sources Nemotron 3 Ultra: 550B-Parameter Hybrid Mamba-Transformer Model for Agentic AI

Technology

Artificial Intelligence #ai#machine learning

NVIDIA Open-Sources Nemotron 3 Ultra: 550B-Parameter Hybrid Mamba-Transformer Model for Agentic AI

NVIDIA introduced Nemotron 3 Ultra, a 550 billion total parameter Mixture-of-Experts language model with a hybrid Mamba-Attention architecture. Only 55 billion parameters are active per inference. Pre-trained on 20 trillion tokens and supporting a 1 million token context length, the model achieves up to 6x higher inference throughput versus state-of-the-art public LLMs while matching accuracy. All checkpoints, training data, and recipes are open-sourced on HuggingFace.

Jun 16, 2026 2 sources