Legal question-answering systems have long been benchmarked against case law, but a new study reveals that statutory domains—where relevant evidence is distributed across hierarchically linked documents—pose unique challenges. According to the research paper "Beyond Case Law: Evaluating Structure-Aware Retrieval and Safety in Statute-Centric Legal QA" on arXiv, conventional retrievers fail in this context, and models often hallucinate under incomplete context. The authors—Chae, Kyubyung, Yeom, Jewon, Park, Jeongjae, Bae, Seunghyun, Jang, Ijun, Hyunbin, Jinkwan, and Kim, Taesup—introduced SearchFireSafety, a structure- and safety-aware benchmark instantiated on fire-safety regulations.
The SearchFireSafety Benchmark
SearchFireSafety adopts a dual-source evaluation framework combining real-world questions that require citation-aware retrieval and synthetic partial-context scenarios that stress-test hallucination and refusal behavior. The benchmark evaluates whether models can retrieve hierarchically fragmented evidence and safely abstain when statutory context is insufficient.
Key Findings: Graph-Guided Retrieval and Safety Trade-off
Experiments across multiple large language models (LLMs) showed that graph-guided retrieval substantially improves performance in retrieving hierarchically fragmented statutory evidence. However, the study also revealed a critical safety trade-off: domain-adapted models are more likely to hallucinate when key statutory evidence is missing. This finding underscores the need for benchmarks that jointly evaluate hierarchical retrieval and model safety.
"Our findings highlight the need for benchmarks that jointly evaluate hierarchical retrieval and model safety in statute-centric regulatory settings."
Implications for Enterprise Legal Technology
For enterprises deploying AI in regulatory compliance, statute-heavy domains such as fire safety, environmental regulations, and tax law demand retrieval systems that understand document hierarchies. The SearchFireSafety benchmark provides a methodology to test both retrieval accuracy and safe refusal—a feature critical to avoiding costly errors. Current systems that rely on flat retrieval or simple semantic search may miss distributed evidence, while domain-adapted fine-tuning can increase hallucination risk when evidence is incomplete.
The trade-off identified in the study suggests that enterprise adopters should not only measure retrieval accuracy but also implement safety mechanisms that force the model to abstain when statutory context is insufficient. This is particularly relevant for any AI system used in regulatory compliance, where incorrect answers can lead to legal liability.
| Benchmark Component | Description |
|---|---|
| Real-world questions | Require citation-aware retrieval from hierarchically linked statutes |
| Synthetic partial-context scenarios | Test model ability to abstain when statutory evidence is missing |
| Evaluation metrics | Retrieval accuracy, hallucination rate, refusal rate |
As legal AI systems move beyond case law into statute-heavy practice areas, tools like SearchFireSafety offer a template for responsible deployment. The study is publicly available on arXiv and its code and data are associated with the article.