Topic
reasoning
Limited Marginal Benefit of Reasoning-Heavy LLMs in ESG Scoring: Study on Japanese Firms
A 4-model consensus study on 10 Japanese listed firms found that reasoning-heavy LLMs add little value over cheaper alternatives in ESG narrative scoring, with a mean absolute deviation of only 0.38 on a 5-point scale and 5.6x higher cost.
AdaMame: New Training Recipe Solves Language Collapse in Multilingual Reasoning Models
AdaMame, a two-stage training recipe for multilingual mathematical reasoning, addresses language collapse in large reasoning models. It adaptively aligns reasoning language to the query language without compromising accuracy, achieving Pareto-optimal performance across 12 languages.
New Self-Enhanced Fine-Tuning Method Boosts Text-to-SQL Reasoning and Generalization
Researchers propose CoTE-SQL, a self-enhanced fine-tuning method that improves text-to-SQL generation by integrating reasoning traces, structured chain-of-thought prompting, and execution error correction. The approach achieves state-of-the-art results on Bird and Spider benchmarks, particularly on complex queries.
New Benchmark IRTS-ToolBench Tests LLMs on Irregular Time Series Question Answering
A research paper introduces IRTS-ToolBench, a benchmark of 1,700 questions spanning 10 task types across 13 domains to evaluate large language models (LLMs) and AI agents on irregular time series question answering (TSQA). The benchmark addresses a gap in existing TSQA benchmarks that assume regular sampling, providing standardized inputs and a reproducible evaluation protocol for verifiable agentic data science.
Latent Thought Flow: Efficient Reasoning in LLMs Cuts Cost and Boosts Accuracy
Researchers propose Latent Thought Flow (LTF), a method that models LLM reasoning as continuous trajectories in latent space, using GFlowNet and entropy-weighted objectives. LTF outperforms explicit Chain-of-Thought and latent reasoning baselines, achieving 9.5% higher accuracy while cutting reasoning length by 27.2%, addressing the linguistic bottleneck that inflates inference costs.
Think-at-Hard: Selective Latent Iterations Boost LLM Reasoning Accuracy by Up to 6.8%
A new research paper proposes Think-at-Hard (TaH), a looped transformer that selectively performs latent iterations only on tokens likely to be incorrect. By skipping iterations on 93% of tokens, TaH outperforms always-iterate models by 3.8-4.4% and single-iteration baselines by up to 6.8%, while requiring negligible extra parameters.
CycliST Benchmark Reveals Video Language Models Struggle with Cyclical State Transitions
The CycliST benchmark, introduced by a team of researchers, evaluates Video Language Models on cyclical state transitions. Results show current VLMs struggle to detect and reason about periodic patterns, with no single model performing consistently across all tasks.
PrologMCP: A Standardized Prolog Tool Interface That Boosts LLM Agents’ Deductive Accuracy
A team of researchers introduced PrologMCP, an open-source server that exposes Prolog as a stateful tool through the Model Context Protocol, allowing LLM agents to delegate deductive reasoning tasks. In evaluations on the PARARULE-Plus benchmark, an agent powered by PrologMCP achieved accuracy of 1.00 on a general sample, matching or exceeding reasoning LLMs, and 1.00/0.99 on a challenging subset where reasoning models dropped to 0.95/0.94.
The Quality-Utility Paradox: Why High-Reward Data Impairs Small Model Mathematical Reasoning
A research paper identifies a 'Quality-Utility Paradox' in mathematical reasoning distillation: data refined by stronger models (Oracle) receives high reward scores but impairs small model performance compared to using the model's own self-generated traces. The authors propose Style-Aligned Refinement to preserve native reasoning patterns while incorporating logical corrections.
Semi-Supervised Framework Scales LLM Reasoning Using 10-15x Fewer Labels Than Traditional Methods
A new semi-supervised framework for training LLM reasoning uses a lightweight verifier to judge reasoning quality, requiring only a few labeled samples. Experiments on math problems and visual question answering show accuracy comparable to 10-15x more labeled data. The method could reduce the cost of building large-scale reasoning datasets.