Multi-Agent Peer-Reviewed Reasoning Boosts LLM Accuracy in Medical Question Answering

Researchers designed a multi-agent peer-reviewed reasoning method for medical question answering, where multiple LLMs generate and evaluate each other's chain-of-thought reasoning. Experiments with five models on three benchmarks showed the approach consistently outperforms single-model reasoning and majority voting, achieving best accuracy of 0.820. The method scales effectively and improves interpretability.

iGEN Editorial

June 16, 2026

Multi-Agent Peer-Reviewed Reasoning Boosts LLM Accuracy in Medical Question Answering

Large language models (LLMs) are increasingly used in medical question answering (MedQA), but ensuring accuracy and interpretability remains a challenge. A new preprint on arXiv proposes a multi-agent peer-reviewed reasoning method that enables LLMs to act as both solvers and evaluators, achieving superior performance on three benchmark datasets.

The Peer-Review Approach

According to the paper titled "Let LLMs Judge Each Other: Multi-Agent Peer-Reviewed Reasoning for Medical Question Answering" by Zaifu, Zhou, Shuang, Zhang, and Rui, the method involves multiple LLM agents independently generating chain-of-thought reasoning along with candidate answers. These agents then act as peer reviewers, evaluating each other's reasoning for factual correctness and logical soundness. The highest-rated reasoning chain is selected to produce the final answer.

The experiments employed five state-of-the-art LLMs: Llama-3.1-8B, Qwen2.5-7B, Phi-4, DeepSeek-LLM-7B, and GPT-oss-20B. Performance was compared against single-model chain-of-thought reasoning and chain-of-thought-based majority voting.

Experiment Results

The peer-reviewed reasoning method consistently outperformed both baselines across the three datasets: HeadQA, MedQA-USMLE, and PubMedQA. The best model combination achieved an average accuracy of 0.820, exceeding the strongest single model (0.777) and the best majority voting ensemble (up to 0.789).

Method	Best Average Accuracy
Strongest single model (chain-of-thought)	0.777
Majority voting (chain-of-thought)	up to 0.789
Multi-agent peer-reviewed reasoning	0.820

Scaling and Evaluation

According to the researchers, the method scales effectively with more participating models. Additionally, peer assessments reliably distinguished high-quality reasoning chains from low-quality ones, underscoring the approach's robustness.

Implications for Trustworthy AI

The authors note that by emphasizing reasoning quality rather than answer agreement alone, this approach improves accuracy, interpretability, and robustness. They state it "offers a promising direction for trustworthy biomedical AI systems." For enterprise technology leaders, the method demonstrates a potential framework for enhancing LLM reliability in high-stakes domains beyond medicine, such as legal, financial, or technical support, where verifiable reasoning is critical.

Sources:

Multi-Agent Peer-Reviewed Reasoning Boosts LLM Accuracy in Medical Question Answering

The Peer-Review Approach

Experiment Results

Scaling and Evaluation

Implications for Trustworthy AI

Recommended Stories

Think Again or Think Longer? Selective Verification Boosts LLM Accuracy While Cutting Compute Costs

Beyond Reasoning Gains: Mitigating General-Capability Forgetting in Large Reasoning Models

DeFrame: New Technique Debiases LLMs Against Subtle Framing Effects

New Method LUCID Detects Hallucinations in LLM-Based Knowledge Graph Reasoning