Large language models (LLMs) are increasingly used in medical question answering (MedQA), but ensuring accuracy and interpretability remains a challenge. A new preprint on arXiv proposes a multi-agent peer-reviewed reasoning method that enables LLMs to act as both solvers and evaluators, achieving superior performance on three benchmark datasets.
The Peer-Review Approach
According to the paper titled "Let LLMs Judge Each Other: Multi-Agent Peer-Reviewed Reasoning for Medical Question Answering" by Zaifu, Zhou, Shuang, Zhang, and Rui, the method involves multiple LLM agents independently generating chain-of-thought reasoning along with candidate answers. These agents then act as peer reviewers, evaluating each other's reasoning for factual correctness and logical soundness. The highest-rated reasoning chain is selected to produce the final answer.
The experiments employed five state-of-the-art LLMs: Llama-3.1-8B, Qwen2.5-7B, Phi-4, DeepSeek-LLM-7B, and GPT-oss-20B. Performance was compared against single-model chain-of-thought reasoning and chain-of-thought-based majority voting.
Experiment Results
The peer-reviewed reasoning method consistently outperformed both baselines across the three datasets: HeadQA, MedQA-USMLE, and PubMedQA. The best model combination achieved an average accuracy of 0.820, exceeding the strongest single model (0.777) and the best majority voting ensemble (up to 0.789).
| Method | Best Average Accuracy |
|---|---|
| Strongest single model (chain-of-thought) | 0.777 |
| Majority voting (chain-of-thought) | up to 0.789 |
| Multi-agent peer-reviewed reasoning | 0.820 |
Scaling and Evaluation
According to the researchers, the method scales effectively with more participating models. Additionally, peer assessments reliably distinguished high-quality reasoning chains from low-quality ones, underscoring the approach's robustness.
Implications for Trustworthy AI
The authors note that by emphasizing reasoning quality rather than answer agreement alone, this approach improves accuracy, interpretability, and robustness. They state it "offers a promising direction for trustworthy biomedical AI systems." For enterprise technology leaders, the method demonstrates a potential framework for enhancing LLM reliability in high-stakes domains beyond medicine, such as legal, financial, or technical support, where verifiable reasoning is critical.