Historical medical archives and traditional medicines hold immense potential for drug discovery, according to a paper on arXiv. However, pre-ontological prose and idiosyncratic taxonomies prevent standardization and medical modernization of the data for use in current biomedical pipelines. The paper reports that no existing LLM agent system—whether tool-calling, retrieval-augmented, or agentic deep-research—can convert such text into verifiable drug-discovery leads at scale. DeepRoot, a multi-agent LLM system introduced in the paper, closes this gap by jointly building and utilizing a verified knowledge graph.
The Problem: Unstructured Historical Medical Data
The paper identifies that historical medical texts contain valuable knowledge but are not machine-readable due to non-standard terminologies and narrative structures. This prevents direct application of modern biomedical pipelines for drug discovery. Existing LLM approaches, including those with tool-calling capabilities, struggle with hallucination and lack systematic reasoning. The authors note that grounding and reasoning—often conflated—are separable axes that a system can compose for therapeutic reasoning.
DeepRoot's Architecture: Knowledge Graph and Multi-Agent Coordination
DeepRoot is a multi-agent system built on large language models (LLMs) that coordinates multiple agents to both construct and query a verified knowledge graph (KG). The paper describes that the system separates the tasks of building the knowledge graph from reasoning over it. This allows the KG to serve as a factual grounding layer, while LLMs provide flexible reasoning. The multi-agent setup enables the system to combine structured knowledge from the graph with natural language inference, aiming to produce verifiable drug-discovery leads.
Performance Results: Accuracy and Hallucination Rates
Applied to the Shen Nong Ben Cao Jing, a classic Chinese medical text, DeepRoot achieved significant results. The paper reports that DeepRoot recovers 10 of 21 held-out compound-disease treatment pairs at R@20, yielding 47.6% accuracy. This compares to 4.8% for a raw corpus LLM and approximately 2.4% for random chance. In an LLM-as-judge audit for reasoning quality, DeepRoot dominated baseline LLMs and LLMs with direct tool-call access to the same APIs that DeepRoot itself queries.
A critical finding concerns hallucination rates. Tool-using LLMs hallucinated evidence on 87% of claims, according to the paper. DeepRoot, by contrast, hallucinated on only 7-10% of claims. Graph-only inference hallucinated 0% but ranked lowest on reasoning coherence. DeepRoot's combined KG+LLM approach was the only condition to win on both axes: low hallucination and high reasoning quality.
| System Condition | Recovery Rate (R@20) | Hallucination Rate | Reasoning Coherence (Rank) |
|---|---|---|---|
| Raw corpus LLM | 4.8% | (not reported separately) | Lower |
| Random baseline | ~2.4% | - | - |
| Tool-using LLMs | (not reported) | 87% | Lower than DeepRoot |
| Graph-only inference | (not reported) | 0% | Lowest |
| DeepRoot (KG+LLM) | 47.6% | 7-10% | Highest |
Implications for Drug Discovery and Medical AI
The paper argues that DeepRoot points toward a systematic route for mining and repurposing historical medical knowledge. By treating grounding and reasoning as separable axes, the system demonstrates that combining a verified knowledge graph with LLM-based reasoning can simultaneously reduce hallucination and improve reasoning quality. This approach could enable scalable conversion of pre-ontological medical texts into structured, actionable knowledge for drug development pipelines. The results on the Shen Nong Ben Cao Jing suggest that similar methods could be applied to other historical medical archives, potentially uncovering treatment leads that have been overlooked in modern research.