Large language models (LLMs) are increasingly deployed to translate natural language questions into SQL queries through in-context learning (ICL), a technique that provides example query pairs to guide the model. However, according to a new study by researchers including Shen, Jiawei, Wan, Chengcheng, Qiao, Ruoyi, et al. (arXiv, 2025), these systems suffer from widespread correctness problems. The study, which the authors describe as the first comprehensive examination of ICL-based text-to-SQL errors, systematically analyzed four representative ICL techniques, five basic repairing methods, two benchmarks, and two LLM settings.
Scope of the Study
The research covered a broad range of configurations to capture real-world error patterns. The four ICL techniques studied include representative approaches from the literature, though the paper does not name them explicitly. The five basic repairing methods span common strategies such as re-prompting or syntax correction. Two standard benchmarks were used along with two LLM settings (likely different model sizes or temperatures). This design allowed the team to identify errors that are persistent across methods and contexts.
Error Categories and Types
The analysis uncovered 27 distinct error types grouped into 7 major categories. While the paper does not enumerate each type, the categories cover semantic, syntactic, and logical mistakes common when LLMs misinterpret database schemas or user intent. The authors note that errors are widespread, indicating that even advanced ICL-based text-to-SQL systems are far from reliable for production use.
Limitations of Existing Repairs
Existing repair attempts show limited correctness improvement, according to the study. The researchers found that current methods suffer from high computational overhead and produce many mis-repairs—fixes that introduce new errors or change correct queries incorrectly. This makes them impractical for enterprise environments where accuracy and speed are critical.
MapleDoctor: A New Detection and Repair Framework
To address these shortcomings, the team developed MapleDoctor, a novel framework for detecting and repairing text-to-SQL errors. MapleDoctor combines error detection with targeted repair strategies. The evaluation demonstrates:
| Metric | Existing Solutions | MapleDoctor | Improvement |
|---|---|---|---|
| Queries repaired | Baseline | +13.8% | More queries fixed |
| Mis-repairs | Common | Negligible | Fewer introduced errors |
| Repair latency | High | -67.4% | Faster repairs |
According to the paper, MapleDoctor outperforms existing solutions by repairing 13.8% more queries while introducing a negligible number of mis-repairs and reducing repair latency by 67.4%. The artifact is publicly available on GitHub, enabling replication and extension.
Implications for Enterprise Database Systems
For enterprises relying on natural language interfaces to databases—common in supply chain analytics, inventory management, and logistics—the findings highlight the gap between LLM capabilities and production reliability. Text-to-SQL errors can lead to incorrect data retrieval, flawed reporting, and costly decision-making. Tools like MapleDoctor offer a path to automated error correction, but the study underscores that manual validation remains essential. The systematic error taxonomy provides a foundation for building more robust systems, and the open-source release invites further innovation from the community.
As LLMs continue to be integrated into enterprise software, understanding and mitigating their failure modes will be critical for achieving trusted automation. This study takes a step toward that goal by quantifying the problem and proposing a practical remedy.