Enterprise adoption of multi-agent LLM systems introduces privacy risks that conventional output-only assessments cannot detect, according to a new benchmark from academic researchers. The benchmark, called AgentLeak, instruments seven privacy-relevant communication pathways and provides a large-scale empirical evaluation focused on final outputs, inter-agent messages, and shared memory. The findings have direct implications for any organization deploying multi-agent AI in sensitive domains such as healthcare, finance, supply chain management, and logistics.
Across 1,000 scenarios spanning healthcare, finance, legal, and corporate domains, the researchers tested five production LLMs: GPT-4o, GPT-4o-mini, Claude 3.5 Sonnet, Mistral Large, and Llama 3.3 70B. They collected 4,979 validated execution traces to measure leakage rates.
Key Findings: Internal Channels Are the Weak Point
Multi-agent configurations reduce leakage in final outputs (C1: 27.2%) compared to single-agent mode (43.2%), but they introduce internal channels that dramatically increase total system exposure. Inter-agent messages (C2) leak at 68.8% — meaning that output-only audits miss 41.7% of violations. The aggregated leakage across final outputs, inter-agent messages, and shared memory (C1, C2, C5) reaches 68.9%.
| Channel | Leakage Rate |
|---|---|
| Single-agent final output (baseline) | 43.2% |
| Multi-agent final output (C1) | 27.2% |
| Inter-agent messages (C2) | 68.8% |
| Aggregated multi-agent (C1+C2+C5) | 68.9% |
The pattern C2 ≥ C1 held consistently across all five models and all four domains.
Why This Matters for Enterprise AI
For enterprise technology leaders deploying multi-agent LLM systems — for example, in automated supply chain coordination, trade finance document processing, or logistics optimization — the research highlights that architectural coordination channels can become the primary vector for data leakage. As the researchers note, "privacy risk in multi-agent systems is strongly shaped by architectural coordination channels rather than final-output behavior alone."
"Inter-agent messages (C2) leak at 68.8%, compared with 27.2% for final outputs (C1), meaning that output-only audits miss 41.7% of violations."
This suggests that standard security practices — such as scanning only the final LLM output for sensitive data — are insufficient. Enterprises must inspect and sanitize inter-agent communication paths, shared memory, and tool arguments to mitigate total exposure.
Research Methodology and Scope
The study, authored by Yagoubi, Faouzi El, Godwin Badu-Marfo, and Ranwa Al Mallah, was released on arXiv (paper ID 2602.11510). It defines AgentLeak as a benchmark that instruments seven privacy-relevant pathways, though the current evaluation focuses on final outputs (C1), inter-agent messages (C2), and shared memory (C5). The coordinator-worker multi-agent architecture was used.
Implications for Supply Chain and Logistics
While the benchmark domains do not explicitly include supply chain or trade, the findings are directly transferable. Multi-agent systems are increasingly used in logistics for tasks like real-time route optimization, customs document handling, and supplier coordination. In these contexts, inter-agent messages often contain proprietary pricing data, contract terms, customer identities, or trade secrets. A leakage rate of 68.8% across inter-agent channels could expose such sensitive information to unintended parties, including competitors or malicious actors.
Technology procurement leaders evaluating multi-agent AI platforms should demand from vendors:
- Audit trails of all inter-agent messages
- Redaction or encryption of internal communication channels
- Benchmarking against tools like AgentLeak before deployment
Competitive and Industry Context
The five LLMs evaluated represent the leading frontier models from OpenAI, Anthropic, Mistral AI, and Meta. No single model performed uniformly better on internal-channel leakage, indicating that the architectural design of multi-agent systems — not just the model choice — drives privacy outcomes.
The AgentLeak benchmark itself is, as of publication, a research artifact rather than a commercial product. However, it provides a methodology that could be adopted or adapted by enterprise security teams or third-party auditors. Startups building multi-agent orchestration platforms (e.g., CrewAI, AutoGen) may face increased scrutiny over internal data handling.
The Bottom Line
For any organization deploying multi-agent LLMs in production, the AgentLeak findings underscore a critical blind spot: standard output-level defenses cannot see the most significant leak path. Enterprises should immediately begin auditing their multi-agent architectures for internal-channel leakage, using benchmarks like AgentLeak as a reference. The technology stack — whether built on top of GPT-4o, Claude, or Llama — must incorporate privacy controls at the agent communication layer.