A new study published on arXiv challenges the fundamental assumption behind how AI-powered tutoring chatbots are evaluated. The paper, authored by Neagu, Alexandra, Wong, Jeffrey T H, Messer, Marcus, Nelson, Rhodri, and Johnson, Peter B, argues that current benchmarks for LLM tutors implicitly assume students will engage with the chatbot’s scaffolding—the graduated steps toward a solution. But in real-world deployments, students frequently ignore or bypass that scaffolding, driving the interaction toward their own learning goals.
The researchers introduce an evaluation pipeline built on two metrics: Chatbot Scaffolding and Student Uptake. They apply these metrics across nine datasets comprising 9,490 chats, spanning both standard AI tutor benchmarks and real-world educational chatbot logs. The analysis reveals a stark contrast: while benchmark environments produce high-scaffolding, high-uptake interactions, real-world students exhibit significantly lower uptake, often sidestepping the chatbot's pedagogical prompts at little interpersonal cost.
The study suggests that bypassing scaffolding is not necessarily detrimental. Instead, it frequently highlights a mismatch between the chatbot's pedagogical framing and the student's actual learning goals. The authors argue that to meaningfully evaluate a chatbot's effectiveness, future benchmarks must move beyond assuming student compliance and instead assess how chatbots navigate diverse learning contexts and student-driven interaction patterns.
The Benchmark Assumption
Current alignment and evaluation methods for embedding scaffolding behaviour into chatbots rest on an implicit assumption: that students will take up the scaffolding and engage in the conversation. This assumption drives the design of benchmark tasks and reward models. According to the paper, when chatbots are tested in controlled settings, they are typically rewarded for providing structured hints and step-by-step guidance, under the expectation that students will follow along.
However, real-world tutoring logs tell a different story. Students often ignore the chatbot's prompts, ask off-topic questions, or demand direct answers. The study quantifies this gap through the Student Uptake metric, which measures how frequently students accept and engage with the chatbot's scaffolding moves.
Two New Metrics for Evaluation
The paper proposes a two-metric evaluation framework:
| Metric | Description | Benchmark Result | Real-World Result |
|---|---|---|---|
| Chatbot Scaffolding | The degree to which a chatbot offers graduated, pedagogically structured assistance | High: Models are trained to maximise this metric | Varies: Some high, but often misaligned with student needs |
| Student Uptake | The frequency with which students accept and follow the chatbot's scaffolding | Assumed high: Benchmarks do not penalise low uptake | Significantly lower: Students frequently bypass or override scaffolding |
According to the authors, the combination of these metrics exposes the mismatch. Benchmarks that only measure chatbot scaffolding miss the critical dimension of student engagement. The paper argues that a chatbot that provides perfect scaffolding but is ignored by students is ultimately ineffective.
Implications for AI Tutor Deployments
For enterprise technology leaders evaluating AI tutors or conversational agents in education, the findings underscore a need to test in real-world conditions rather than relying solely on benchmark scores. The paper's authors emphasise that future benchmarks must incorporate student-driven interaction patterns and evaluate how chatbots handle diverse learning contexts.
The study also suggests that bypassing behaviour may be a sign of students exercising agency to meet their own goals—a potentially productive behaviour that should not be penalised by evaluation frameworks. Instead, chatbots should be designed to adapt to student initiative, not just to deliver pre-scripted scaffolding sequences.
Moving Beyond Assumption
The paper concludes that relying on current alignment and evaluation methods is insufficient. The call to action for the research community is clear: develop benchmarks that measure how chatbots navigate real-world, student-led interactions, not just how well they produce scaffolding moves. This shift would better reflect the complexities of human learning and tutor–student dynamics.