Transmitting the output of large language models (LLMs) is bandwidth-intensive, especially for applications with limited connectivity or high data costs. A new research paper on arXiv from authors Rinberg, Roy, Carrell, Annabelle Michael, Henniger, Simon, Carlini, Nicholas, Warr, and Keri introduces several compression techniques that dramatically reduce the size of LLM-generated responses, in some cases by more than 100x compared to earlier methods.
The study, titled "Haiku to Opus in Just 10 bits: LLMs Unlock Large Compression Gains," explores both lossless and lossy compression of LLM-generated text. The authors characterize a compression-compute frontier where greater compression is achievable at the cost of more computation.
Lossless Compression with LoRA Adapters
For lossless compression, the paper demonstrates that domain-adapted LoRA adapters can improve LLM-based arithmetic coding by 2x over compression using the base LLM alone. This means that without losing any information, the same output can be transmitted in half the bits, enabling faster and cheaper data transfer for applications that require exact reproduction of model responses.
Lossy Compression via Succinct Rewrite
In the lossy regime, the researchers prompt the model to produce a succinct rewrite of the original response, then apply arithmetic coding. This method achieves compression ratios of approximately 0.03, representing a 2x improvement over compressing the original response directly. Such ratios are suitable for scenarios where a slight reduction in fidelity is acceptable in exchange for significant bandwidth savings.
Question-Asking Compression (QA)
The most striking result comes from a novel interactive protocol called Question-Asking compression (QA). Inspired by the game "Twenty Questions," a small model iteratively refines its response by asking yes/no questions to a stronger model, transferring exactly one bit per answer. After just 10 binary questions, the small model recovers between 23% and 72% of the capability gap between the two models on standard benchmarks, and between 7% and 38% on harder benchmarks. The corresponding compression ratios range from 0.0006 to 0.004, which is over 100x smaller than prior LLM-based compression methods (Deletang et al., 2024).
The table below summarizes the performance across eight benchmarks spanning math, science, and code:
| Benchmark Set | Capability Recovery (Standard) | Capability Recovery (Hard) | Compression Ratio |
|---|---|---|---|
| 8 benchmarks (math, science, code) | 23% – 72% | 7% – 38% | 0.0006 – 0.004 |
The authors suggest that interactive protocols like QA can transfer knowledge far more efficiently than simply transmitting full responses. This has implications for real-time AI applications, mobile deployments, and distributed model serving where bandwidth is a bottleneck.
The paper is available on arXiv and includes code and data for reproducibility. The research underscores the potential of combining small and large models through intelligent communication to achieve high performance with minimal data transfer.