Software testing is critical for verifying that systems meet specified requirements, yet remains among the most time-consuming and expensive activities in development. Requirements-based test generation allows test cases to be derived early from requirements artifacts, but generating them directly from natural language is challenging due to inherent ambiguity and imprecision. According to a systematic survey by Folorunsho, Orimoloye, Reza, and Hassan (arXiv, 2026), recent advances in AI, natural language processing (NLP), and large language models (LLMs) have made automating this pipeline increasingly feasible, while introducing new risks including hallucination, reduced traceability, and inconsistent evaluation.
The Research Approach
Following Kitchenham and Charters' systematic review guidelines, the researchers searched major scholarly databases spanning 2000–2025 and, after applying strict inclusion criteria, identified 21 primary studies. The literature was organized into three evolutionary eras, enabling a structured analysis of how techniques have progressed.
Three Eras of AI Test Generation
The survey maps the evolution across three eras:
- Early rule-based and template-driven approaches (pre-2010)
- Machine learning and statistical NLP methods (2010–2020)
- Deep learning and LLM-based generation (2020–2025)
Each era brings improvements in automation but also new challenges, particularly around hallucination and traceability in the LLM era.
Six Quality Dimensions: No Complete Solution
A central finding of the survey is that no existing approach simultaneously satisfies all six key quality dimensions identified by the authors. The dimensions and their coverage status are:
| Quality Dimension | Description | Status Across 21 Studies |
|---|---|---|
| Automation | Fully automated test generation from NL | Partially achieved by several LLM-based tools |
| Ambiguity handling | Resolving imprecision in natural language | Most studies lack robust ambiguity resolution |
| Domain applicability | Adaptability to different domains (e.g., finance, healthcare) | Limited; most techniques are domain-specific |
| Traceability | Linking generated tests back to original requirements | Weak in many works; a key research gap |
| Evaluation thoroughness | Rigorous metrics and benchmarks for test quality | Inconsistent metrics across studies |
| Hallucination control | Preventing LLMs from inventing unsupported behaviors | Rarely addressed; emerging concern |
As the survey states, "no existing approach simultaneously satisfies six key quality dimensions."
Actionable Research Guidelines
The survey contributes four actionable research guidelines aimed at closing the identified gaps:
- Hallucination: Develop methods to detect and mitigate hallucinations in generated test cases.
- Traceability: Ensure clear links between natural language requirements and each test case.
- Complexity sensitivity: Design techniques that scale with requirement complexity.
- Compliance: Align generated tests with regulatory standards and domain-specific constraints.
Implications for Enterprise Technology Leaders
For CTOs and technology procurement leaders, these findings highlight that while AI-driven test generation from natural language is promising, production-ready solutions are not yet available across all quality dimensions. Organizations investing in LLM-based testing tools should evaluate products against the six criteria, especially traceability and hallucination control, which directly affect reliability in regulated environments. The survey provides a framework for assessing vendor claims and setting realistic expectations for automation timelines.
The study is available as a preprint on arXiv (arXiv:2606.06563) and offers a comprehensive reference for researchers and practitioners navigating this rapidly evolving field.