iGEN
Visit IGEN World Explore IGEN Expo
EXPLORE UPGRADE PLANS
BREAKING
Multi-Sequence Verifiers Cut Inference Latency in Half for LLM Reasoning Language-Guided AI Framework CLARITY Boosts Road Scene Segmentation for Autonomous Logistics When RAG Hurts: Research Identifies Attention Distraction in Vision-Language AI Models and Proposes Mitigation Strait of Hormuz Reopening: Mine Clearance Delays Threaten Weeks-Long Recovery for Oil Shipping India’s REITs and InvITs May Attract Rs 11.6 Lakh Crore Investment by 2030, Avendus Report Says DualGauge: Automated Joint Security-Functionality Benchmarking of Specification-Only Code Generation by LLMs and Coding Agents Nimble SharePower: Modular Power Bank Lets You Share a Charge With a Friend OBCache Prunes KV Cache for Efficient Long-Context LLM Inference with Output-Aware Scoring 'Dangerous' AI Models: Enterprise Leaders Must Prepare for Broad Availability Air India Launches 'Basic Fare' Option Without Complimentary Meals on Select Domestic Flights Multi-Sequence Verifiers Cut Inference Latency in Half for LLM Reasoning Language-Guided AI Framework CLARITY Boosts Road Scene Segmentation for Autonomous Logistics When RAG Hurts: Research Identifies Attention Distraction in Vision-Language AI Models and Proposes Mitigation Strait of Hormuz Reopening: Mine Clearance Delays Threaten Weeks-Long Recovery for Oil Shipping India’s REITs and InvITs May Attract Rs 11.6 Lakh Crore Investment by 2030, Avendus Report Says DualGauge: Automated Joint Security-Functionality Benchmarking of Specification-Only Code Generation by LLMs and Coding Agents Nimble SharePower: Modular Power Bank Lets You Share a Charge With a Friend OBCache Prunes KV Cache for Efficient Long-Context LLM Inference with Output-Aware Scoring 'Dangerous' AI Models: Enterprise Leaders Must Prepare for Broad Availability Air India Launches 'Basic Fare' Option Without Complimentary Meals on Select Domestic Flights
Home ›› Technology ›› Ai ›› Llms ›› Haiku to Opus in Just 10 bits: LLMs Unlock Large Compression Gains

Haiku to Opus in Just 10 bits: LLMs Unlock Large Compression Gains

A new arXiv paper presents methods for compressing LLM-generated text, achieving over 100x reduction in data transfer compared to prior techniques. Lossless compression via domain-adapted LoRA adapters doubles efficiency, while an interactive Question-Asking protocol recovers up to 72% of the capability gap between small and large models using only 10 binary questions.

iG
iGEN Editorial
June 16, 2026
Haiku to Opus in Just 10 bits: LLMs Unlock Large Compression Gains

Transmitting the output of large language models (LLMs) is bandwidth-intensive, especially for applications with limited connectivity or high data costs. A new research paper on arXiv from authors Rinberg, Roy, Carrell, Annabelle Michael, Henniger, Simon, Carlini, Nicholas, Warr, and Keri introduces several compression techniques that dramatically reduce the size of LLM-generated responses, in some cases by more than 100x compared to earlier methods.

The study, titled "Haiku to Opus in Just 10 bits: LLMs Unlock Large Compression Gains," explores both lossless and lossy compression of LLM-generated text. The authors characterize a compression-compute frontier where greater compression is achievable at the cost of more computation.

Lossless Compression with LoRA Adapters

For lossless compression, the paper demonstrates that domain-adapted LoRA adapters can improve LLM-based arithmetic coding by 2x over compression using the base LLM alone. This means that without losing any information, the same output can be transmitted in half the bits, enabling faster and cheaper data transfer for applications that require exact reproduction of model responses.

Lossy Compression via Succinct Rewrite

In the lossy regime, the researchers prompt the model to produce a succinct rewrite of the original response, then apply arithmetic coding. This method achieves compression ratios of approximately 0.03, representing a 2x improvement over compressing the original response directly. Such ratios are suitable for scenarios where a slight reduction in fidelity is acceptable in exchange for significant bandwidth savings.

Question-Asking Compression (QA)

The most striking result comes from a novel interactive protocol called Question-Asking compression (QA). Inspired by the game "Twenty Questions," a small model iteratively refines its response by asking yes/no questions to a stronger model, transferring exactly one bit per answer. After just 10 binary questions, the small model recovers between 23% and 72% of the capability gap between the two models on standard benchmarks, and between 7% and 38% on harder benchmarks. The corresponding compression ratios range from 0.0006 to 0.004, which is over 100x smaller than prior LLM-based compression methods (Deletang et al., 2024).

The table below summarizes the performance across eight benchmarks spanning math, science, and code:

Benchmark Set Capability Recovery (Standard) Capability Recovery (Hard) Compression Ratio
8 benchmarks (math, science, code) 23% – 72% 7% – 38% 0.0006 – 0.004

The authors suggest that interactive protocols like QA can transfer knowledge far more efficiently than simply transmitting full responses. This has implications for real-time AI applications, mobile deployments, and distributed model serving where bandwidth is a bottleneck.

The paper is available on arXiv and includes code and data for reproducibility. The research underscores the potential of combining small and large models through intelligent communication to achieve high performance with minimal data transfer.


Sources:

Keep Reading

Recommended Stories

How Scale Design Impacts LLM Metacognition and Enterprise AI Reliability Technology

How Scale Design Impacts LLM Metacognition and Enterprise AI Reliability

A study on arXiv reveals that the confidence scale used in LLMs (typically 0-100) leads to heavy discretization, with over 78% of responses on three round numbers. Changing the scale to 0-20 improves metacognitive efficiency. The findings have implications for enterprise use of LLMs in supply chain decision-making where confidence calibration is critical.

June 16, 2026
LLM Jaggedness Unlocks Scientific Creativity: New Benchmark Reveals Uneven AI Capabilities Can Be Harnessed for Innovation Technology

LLM Jaggedness Unlocks Scientific Creativity: New Benchmark Reveals Uneven AI Capabilities Can Be Harnessed for Innovation

A new arXiv paper introduces SciAidanBench, a benchmark for measuring the scientific creativity of large language models. The research finds that LLM capabilities are jagged—uneven across tasks and domains—but that this jaggedness can be harnessed through ensemble methods to produce superior scientific ideas.

June 16, 2026
New Research Reveals Truthfulness Preserved Across LLM Lineages, Enabling Better Hallucination Control Technology

New Research Reveals Truthfulness Preserved Across LLM Lineages, Enabling Better Hallucination Control

A new paper from researchers shows that truthfulness-related attention heads are preserved across generations of large language models, even after instruction tuning or multimodal adaptation. The authors propose TruthProbe, a soft-gating strategy that amplifies these heads to reduce hallucinations, with improvements on HaluEval, POPE, and CHAIR benchmarks.

June 16, 2026
Tree-like Self-Play Framework Teaches LLMs to Fix Security Flaws in Code Generation Technology

Tree-like Self-Play Framework Teaches LLMs to Fix Security Flaws in Code Generation

Researchers introduce Tree-like Self-Play (TSP), a framework that treats secure code generation as a fine-grained sequential decision process. TSP significantly outperforms standard supervised fine-tuning (SFT) and reinforcement learning (RL) on Python security benchmarks, achieving a 75.8% pass rate and reducing unseen vulnerabilities by 24.5% while generalising across programming languages.

June 16, 2026