Topic
model evaluation
VibeThinker-3B: Small Language Model Matches Giants in Verifiable Reasoning, According to arXiv Paper
A new technical report on arXiv introduces VibeThinker-3B, a compact 3B-parameter language model that achieves verifiable reasoning scores comparable to models orders of magnitude larger, including DeepSeek V3.2, GLM-5, and Gemini 3 Pro. The model uses a Spectrum-to-Signal post-training paradigm and achieves 94.3 on AIME26 and 80.2% Pass@1 on LiveCodeBench v6.
New Method Resolves Drift Attribution Ambiguity in LLM Evaluation Pipelines
A research paper introduces an anytime-valid attribution method for LLM evaluation pipelines that resolves the ambiguity between product drift and judge model changes. Using a fixed human-labeled anchor set and betting e-processes, the method achieved zero misattribution on silent version bumps and correctly attributed prompt changes in 110 of 120 runs, while the industry-default rolling z-test false-alarmed on 75% of drift-free streams.