The promise of large language models as personalized tutors has generated excitement, but a new study argues that strong problem-solving ability does not guarantee effective teaching. According to a paper posted on arXiv by researchers Yao, Junyi, Zheng, Zihao, Li, and Baichuan, the authors propose a lightweight diagnostic to distinguish whether LLM tutors actually support learning or merely produce answers.
The Problem with LLM Tutors
As organizations increasingly deploy LLMs for training and education, the paper cautions that educational-impact evaluation should not treat task success as a sufficient proxy for learning support. The researchers note that stronger task-solving ability does not necessarily imply stronger learning support. This distinction is critical for enterprise technology buyers evaluating LLM-based tutoring or training systems, where the goal is knowledge transfer, not just answer generation.
A Lightweight Diagnostic
The diagnostic is based on the gap between solving-oriented and pedagogy-oriented benchmark performance. Using the public MathTutorBench leaderboard, the authors analyzed eight publicly reported models. They computed composites for solving and pedagogy and found that these dimensions are only partially aligned. The correlation between the two composites is 0.421, meaning that a model's ability to solve problems correctly does not strongly predict its ability to teach effectively.
When evaluation moves from solving to pedagogy, several models shift meaningfully in rank. The paper states that "several models shift meaningfully in rank when evaluation moves from solving to pedagogy," emphasizing that ranking based solely on task accuracy can be misleading for educational applications.
TutorBench and Agency-Relevant Behaviors
The researchers also analyzed the public TutorBench sample and found that agency-relevant behaviors are explicitly encoded in benchmark rubrics. These behaviors are especially prominent in active-learning settings that reward guiding questions, calibrated hints, and non-disclosive scaffolding. The paper argues that public tutoring benchmarks can better support positive-impact evaluation by reporting solving-oriented and pedagogy-oriented scores separately. It also recommends making disclosure-sensitive and student-agency-preserving criteria more explicit.
| Metric | Value |
|---|---|
| Models analyzed | 8 |
| Benchmark | MathTutorBench |
| Correlation (solving vs. pedagogy) | 0.421 |
| Rank shift after switching to pedagogy | Substantial for multiple models |
| Key teaching behaviors evaluated | Guiding questions, calibrated hints, non-disclosive scaffolding |
Implications for Enterprise Technology Evaluation
For CTOs and technology procurement leaders evaluating LLM-based training or knowledge management systems, these findings underscore the importance of looking beyond pure task accuracy. An LLM that excels at generating correct answers may fail to foster understanding when used as a tutor. The proposed diagnostic offers a lightweight method for benchmarking educational impact. By decoupling solving and pedagogy scores, organizations can identify systems that truly support learning, reducing the risk of investing in models that perform well on conventional benchmarks but underdeliver in educational or training contexts. The paper's recommendation for separate reporting of these two dimensions provides a concrete, actionable way for both vendors and buyers to assess LLM suitability for learning applications.