New Legal QA Benchmark Exposes Hallucination Risks in Statute-Centric AI Retrieval

Researchers have introduced SearchFireSafety, a benchmark for statute-centric legal QA that evaluates hierarchical retrieval and safety. The study found that while graph-guided retrieval improves performance, domain-adapted large language models are more likely to hallucinate when key statutory evidence is missing, highlighting a critical safety trade-off.

iGEN Editorial

June 17, 2026

New Legal QA Benchmark Exposes Hallucination Risks in Statute-Centric AI Retrieval

Legal question-answering systems have long been benchmarked against case law, but a new study reveals that statutory domains—where relevant evidence is distributed across hierarchically linked documents—pose unique challenges. According to the research paper "Beyond Case Law: Evaluating Structure-Aware Retrieval and Safety in Statute-Centric Legal QA" on arXiv, conventional retrievers fail in this context, and models often hallucinate under incomplete context. The authors—Chae, Kyubyung, Yeom, Jewon, Park, Jeongjae, Bae, Seunghyun, Jang, Ijun, Hyunbin, Jinkwan, and Kim, Taesup—introduced SearchFireSafety, a structure- and safety-aware benchmark instantiated on fire-safety regulations.

The SearchFireSafety Benchmark

SearchFireSafety adopts a dual-source evaluation framework combining real-world questions that require citation-aware retrieval and synthetic partial-context scenarios that stress-test hallucination and refusal behavior. The benchmark evaluates whether models can retrieve hierarchically fragmented evidence and safely abstain when statutory context is insufficient.

Key Findings: Graph-Guided Retrieval and Safety Trade-off

Experiments across multiple large language models (LLMs) showed that graph-guided retrieval substantially improves performance in retrieving hierarchically fragmented statutory evidence. However, the study also revealed a critical safety trade-off: domain-adapted models are more likely to hallucinate when key statutory evidence is missing. This finding underscores the need for benchmarks that jointly evaluate hierarchical retrieval and model safety.

"Our findings highlight the need for benchmarks that jointly evaluate hierarchical retrieval and model safety in statute-centric regulatory settings."

Implications for Enterprise Legal Technology

For enterprises deploying AI in regulatory compliance, statute-heavy domains such as fire safety, environmental regulations, and tax law demand retrieval systems that understand document hierarchies. The SearchFireSafety benchmark provides a methodology to test both retrieval accuracy and safe refusal—a feature critical to avoiding costly errors. Current systems that rely on flat retrieval or simple semantic search may miss distributed evidence, while domain-adapted fine-tuning can increase hallucination risk when evidence is incomplete.

The trade-off identified in the study suggests that enterprise adopters should not only measure retrieval accuracy but also implement safety mechanisms that force the model to abstain when statutory context is insufficient. This is particularly relevant for any AI system used in regulatory compliance, where incorrect answers can lead to legal liability.

Benchmark Component	Description
Real-world questions	Require citation-aware retrieval from hierarchically linked statutes
Synthetic partial-context scenarios	Test model ability to abstain when statutory evidence is missing
Evaluation metrics	Retrieval accuracy, hallucination rate, refusal rate

As legal AI systems move beyond case law into statute-heavy practice areas, tools like SearchFireSafety offer a template for responsible deployment. The study is publicly available on arXiv and its code and data are associated with the article.

Sources:

New Legal QA Benchmark Exposes Hallucination Risks in Statute-Centric AI Retrieval

The SearchFireSafety Benchmark

Key Findings: Graph-Guided Retrieval and Safety Trade-off

Implications for Enterprise Legal Technology

Recommended Stories

SafeSpec: New Framework Boosts LLM Safety Without Sacrificing Inference Speed

Efficient and Sound Probabilistic Verification Secures AI Agents Against Policy Violations

Before the Labels: How Dataset Construction Biases Suicidality Detection in Clinical Text

Diffusion Language Models Show Promise but Demand Careful Inference Tuning, Study Finds