Haiku to Opus in Just 10 bits: LLMs Unlock Large Compression Gains

A new arXiv paper presents methods for compressing LLM-generated text, achieving over 100x reduction in data transfer compared to prior techniques. Lossless compression via domain-adapted LoRA adapters doubles efficiency, while an interactive Question-Asking protocol recovers up to 72% of the capability gap between small and large models using only 10 binary questions.

iGEN Editorial

June 16, 2026

Transmitting the output of large language models (LLMs) is bandwidth-intensive, especially for applications with limited connectivity or high data costs. A new research paper on arXiv from authors Rinberg, Roy, Carrell, Annabelle Michael, Henniger, Simon, Carlini, Nicholas, Warr, and Keri introduces several compression techniques that dramatically reduce the size of LLM-generated responses, in some cases by more than 100x compared to earlier methods.

The study, titled "Haiku to Opus in Just 10 bits: LLMs Unlock Large Compression Gains," explores both lossless and lossy compression of LLM-generated text. The authors characterize a compression-compute frontier where greater compression is achievable at the cost of more computation.

Lossless Compression with LoRA Adapters

For lossless compression, the paper demonstrates that domain-adapted LoRA adapters can improve LLM-based arithmetic coding by 2x over compression using the base LLM alone. This means that without losing any information, the same output can be transmitted in half the bits, enabling faster and cheaper data transfer for applications that require exact reproduction of model responses.

Lossy Compression via Succinct Rewrite

In the lossy regime, the researchers prompt the model to produce a succinct rewrite of the original response, then apply arithmetic coding. This method achieves compression ratios of approximately 0.03, representing a 2x improvement over compressing the original response directly. Such ratios are suitable for scenarios where a slight reduction in fidelity is acceptable in exchange for significant bandwidth savings.

Question-Asking Compression (QA)

The most striking result comes from a novel interactive protocol called Question-Asking compression (QA). Inspired by the game "Twenty Questions," a small model iteratively refines its response by asking yes/no questions to a stronger model, transferring exactly one bit per answer. After just 10 binary questions, the small model recovers between 23% and 72% of the capability gap between the two models on standard benchmarks, and between 7% and 38% on harder benchmarks. The corresponding compression ratios range from 0.0006 to 0.004, which is over 100x smaller than prior LLM-based compression methods (Deletang et al., 2024).

The table below summarizes the performance across eight benchmarks spanning math, science, and code:

Benchmark Set	Capability Recovery (Standard)	Capability Recovery (Hard)	Compression Ratio
8 benchmarks (math, science, code)	23% – 72%	7% – 38%	0.0006 – 0.004

The authors suggest that interactive protocols like QA can transfer knowledge far more efficiently than simply transmitting full responses. This has implications for real-time AI applications, mobile deployments, and distributed model serving where bandwidth is a bottleneck.

The paper is available on arXiv and includes code and data for reproducibility. The research underscores the potential of combining small and large models through intelligent communication to achieve high performance with minimal data transfer.

Sources:

Haiku to Opus in Just 10 bits: LLMs Unlock Large Compression Gains

Lossless Compression with LoRA Adapters

Lossy Compression via Succinct Rewrite

Question-Asking Compression (QA)

Recommended Stories

Reinforcement-Aware Knowledge Distillation Boosts LLM Reasoning Efficiency

How Scale Design Impacts LLM Metacognition and Enterprise AI Reliability

LLM Jaggedness Unlocks Scientific Creativity: New Benchmark Reveals Uneven AI Capabilities Can Be Harnessed for Innovation

New Research Reveals Truthfulness Preserved Across LLM Lineages, Enabling Better Hallucination Control