VibeThinker-3B: Small Language Model Matches Giants in Verifiable Reasoning, According to arXiv Paper

A new technical report on arXiv introduces VibeThinker-3B, a compact 3B-parameter language model that achieves verifiable reasoning scores comparable to models orders of magnitude larger, including DeepSeek V3.2, GLM-5, and Gemini 3 Pro. The model uses a Spectrum-to-Signal post-training paradigm and achieves 94.3 on AIME26 and 80.2% Pass@1 on LiveCodeBench v6.

iGEN Editorial

June 16, 2026

VibeThinker-3B: Small Language Model Matches Giants in Verifiable Reasoning, According to arXiv Paper

A compact language model with only 3 billion parameters is matching the verifiable reasoning performance of models many times its size, according to a technical report published on arXiv. The model, named VibeThinker-3B, was developed to explore how far reasoning capabilities can be pushed within a strictly small-model regime.

Model Architecture and Training Pipeline

VibeThinker-3B is a dense model with 3 billion parameters, built upon the Spectrum-to-Signal post-training paradigm. The paper describes an optimized pipeline that includes curriculum-based supervised fine-tuning, multi-domain reinforcement learning, and offline self-distillation. This combination systematically enhances the model's reasoning ability while keeping the parameter count low.

Benchmark Performance and Comparative Analysis

Experimental evaluations reported in the paper show frontier-level performance on demanding verifiable tasks. The table below summarizes key scores:

Benchmark	Score	Notes
AIME26	94.3	Improves to 97.1 with claim-level test-time scaling
LiveCodeBench v6	80.2 Pass@1	-
LeetCode unseen contests	96.1% acceptance rate	Out-of-distribution generalization
IFEval	93.4	Measures instruction controllability

The paper states that these results place VibeThinker-3B "in the performance band of first-tier reasoning systems," matching or exceeding flagship models that are orders of magnitude larger, such as DeepSeek V3.2, GLM-5, and Gemini 3 Pro. The IFEval score of 93.4 confirms that this extreme reasoning enhancement does not compromise strict instruction controllability.

The Parametric Compression-Coverage Hypothesis

Extending the authors' previous work on a 1.5B model, the report introduces the Parametric Compression-Coverage Hypothesis. This view posits that verifiable reasoning can be compressed into compact reasoning cores, while open-domain knowledge and general-purpose competence require broad parameter coverage over facts, concepts, and long-tail scenarios. The paper suggests that compact models are not merely deployment-efficient substitutes, but a complementary path toward frontier-level performance in parameter-dense capability regimes.

The authors of the report are Xu, Sen; Liu, Shixi; Wang, Wei; Min, Jixin; Dai, Yingwei; Zhibin; Chen, Yirong; Zhou, Xin; and Zhang, Junlin. The full paper is available under a CC Zero license on arXiv with identifier 2606.16140.

Sources:

VibeThinker-3B: Small Language Model Matches Giants in Verifiable Reasoning, According to arXiv Paper

Model Architecture and Training Pipeline

Benchmark Performance and Comparative Analysis

The Parametric Compression-Coverage Hypothesis

Recommended Stories

New Framework TRACED Evaluates LLM Reasoning Using Geometric Stability and Progress

Vernier Research Reveals Why Language Models Give Inconsistent Answers to Causal Questions After Variable Renaming

Beyond Reasoning Gains: Mitigating General-Capability Forgetting in Large Reasoning Models

Large Language Models Can Read Compressed Text That Humans Cannot, Researchers Find