DeepRoot Multi-Agent System Enables Therapeutic Reasoning Over Historical Medical Texts with 47.6% Accuracy

DeepRoot is a multi-agent LLM system that jointly builds and utilizes a verified knowledge graph for therapeutic reasoning over historical medical texts. Applied to the Shen Nong Ben Cao Jing, it recovers 10 of 21 held-out compound-disease treatment pairs at R@20 (47.6%), significantly outperforming a raw corpus LLM (4.8%) and random baseline (2.4%). The system also reduces hallucination to 7-10% compared to 87% for tool-using LLMs, offering a scalable method for mining historical medical knowledge.

iGEN Editorial

June 16, 2026

DeepRoot Multi-Agent System Enables Therapeutic Reasoning Over Historical Medical Texts with 47.6% Accuracy

Historical medical archives and traditional medicines hold immense potential for drug discovery, according to a paper on arXiv. However, pre-ontological prose and idiosyncratic taxonomies prevent standardization and medical modernization of the data for use in current biomedical pipelines. The paper reports that no existing LLM agent system—whether tool-calling, retrieval-augmented, or agentic deep-research—can convert such text into verifiable drug-discovery leads at scale. DeepRoot, a multi-agent LLM system introduced in the paper, closes this gap by jointly building and utilizing a verified knowledge graph.

The Problem: Unstructured Historical Medical Data

The paper identifies that historical medical texts contain valuable knowledge but are not machine-readable due to non-standard terminologies and narrative structures. This prevents direct application of modern biomedical pipelines for drug discovery. Existing LLM approaches, including those with tool-calling capabilities, struggle with hallucination and lack systematic reasoning. The authors note that grounding and reasoning—often conflated—are separable axes that a system can compose for therapeutic reasoning.

DeepRoot's Architecture: Knowledge Graph and Multi-Agent Coordination

DeepRoot is a multi-agent system built on large language models (LLMs) that coordinates multiple agents to both construct and query a verified knowledge graph (KG). The paper describes that the system separates the tasks of building the knowledge graph from reasoning over it. This allows the KG to serve as a factual grounding layer, while LLMs provide flexible reasoning. The multi-agent setup enables the system to combine structured knowledge from the graph with natural language inference, aiming to produce verifiable drug-discovery leads.

Performance Results: Accuracy and Hallucination Rates

Applied to the Shen Nong Ben Cao Jing, a classic Chinese medical text, DeepRoot achieved significant results. The paper reports that DeepRoot recovers 10 of 21 held-out compound-disease treatment pairs at R@20, yielding 47.6% accuracy. This compares to 4.8% for a raw corpus LLM and approximately 2.4% for random chance. In an LLM-as-judge audit for reasoning quality, DeepRoot dominated baseline LLMs and LLMs with direct tool-call access to the same APIs that DeepRoot itself queries.

A critical finding concerns hallucination rates. Tool-using LLMs hallucinated evidence on 87% of claims, according to the paper. DeepRoot, by contrast, hallucinated on only 7-10% of claims. Graph-only inference hallucinated 0% but ranked lowest on reasoning coherence. DeepRoot's combined KG+LLM approach was the only condition to win on both axes: low hallucination and high reasoning quality.

System Condition	Recovery Rate (R@20)	Hallucination Rate	Reasoning Coherence (Rank)
Raw corpus LLM	4.8%	(not reported separately)	Lower
Random baseline	~2.4%	-	-
Tool-using LLMs	(not reported)	87%	Lower than DeepRoot
Graph-only inference	(not reported)	0%	Lowest
DeepRoot (KG+LLM)	47.6%	7-10%	Highest

Implications for Drug Discovery and Medical AI

The paper argues that DeepRoot points toward a systematic route for mining and repurposing historical medical knowledge. By treating grounding and reasoning as separable axes, the system demonstrates that combining a verified knowledge graph with LLM-based reasoning can simultaneously reduce hallucination and improve reasoning quality. This approach could enable scalable conversion of pre-ontological medical texts into structured, actionable knowledge for drug development pipelines. The results on the Shen Nong Ben Cao Jing suggest that similar methods could be applied to other historical medical archives, potentially uncovering treatment leads that have been overlooked in modern research.

Sources:

DeepRoot Multi-Agent System Enables Therapeutic Reasoning Over Historical Medical Texts with 47.6% Accuracy

The Problem: Unstructured Historical Medical Data

DeepRoot's Architecture: Knowledge Graph and Multi-Agent Coordination

Performance Results: Accuracy and Hallucination Rates

Implications for Drug Discovery and Medical AI

Recommended Stories

New Method LUCID Detects Hallucinations in LLM-Based Knowledge Graph Reasoning

New Framework TRACED Evaluates LLM Reasoning Using Geometric Stability and Progress

AdaMame: New Training Recipe Solves Language Collapse in Multilingual Reasoning Models

CREDENCE Framework Improves Automated Fact-Checking with Semantic Metrics and Convergence Analysis