FP8 Debunks FP64 as HPC Holy Grail in New Paper from Satoshi Matsuoka

A new arXiv preprint by Satoshi Matsuoka challenges the long-held belief that native FP64 hardware is essential for high-performance scientific computing. The paper proposes that FP8 tensor-core operations, combined with the Ozaki Scheme II, can deliver equivalent double-precision accuracy, reducing FP64 from a hardware requirement to a derived guarantee. The analytical framework is tested across a five-layer hierarchy, projecting improved performance on upcoming NVIDIA GPUs.

iGEN Editorial

June 16, 2026

FP8 Debunks FP64 as HPC Holy Grail in New Paper from Satoshi Matsuoka

The assumption that native hardware FP64 is the irreducible foundation of scientific computing is being directly challenged in a new paper on arXiv. Authored by Satoshi Matsuoka, the preprint titled "FP8 is All You Need (Part 1): Debunking Hardware FP64 as the HPC Holy Grail (June 13th version)" argues that on AI-optimized GPUs of the NVIDIA B300 generation and beyond, FP8 tensor throughput has grown to multiple PFLOPS while native FP64 throughput has collapsed to approximately 1.3 TFLOPS. The paper claims this shift is not just survivable but preferable: the FP8 tensor-core matrix-multiply can serve as the sole computational primitive for double-precision scientific computing.

The Ozaki Scheme II and FP8 Composition

According to the paper, every canonical kernel in scientific computing—dense and sparse linear algebra, spectral transforms, stencils—along with every application composing them, can be reduced via the Ozaki Scheme II to sequences of FP8 matrix operations. The Ozaki Scheme II relies on the Chinese Remainder Theorem to maintain accuracy. The only non-FP8 arithmetic involved is a bounded, fixed-width integer accumulation at reconstruction. This approach demotes native FP64 from a hardware requirement to a derived accuracy guarantee obtained by composition over the FP8 primitive.

Five-Layer Hierarchy and the TME Model

Matsuoka organizes the claim as a five-layer hierarchy: the FP8 op, Ozaki II, the basic kernels or Berkeley "dwarfs", composite solvers, and full applications. Because the dwarf taxonomy already spans scientific computing, the paper establishes the claim by exhibiting the reduction for every dwarf rather than a sample. The claim is falsifiable, and the paper builds an instrument to test it: a Tensor-Memory Equilibrium (TME) model that extends the Roofline model with emulation parameters designated as alpha, beta, and gamma.

The TME model identifies register-level fusion as the mechanism that keeps emulation memory-bound. It projects recovered FP64 performance across NVIDIA's B300 and Rubin architectures against an H100 baseline. The model could have returned a negative verdict, but according to the paper, it passes across the dwarfs and their compositions. This is the analytical half of a two-part program, with a follow-on implementation to validate the thesis on real silicon.

Implications for Enterprise HPC

For enterprise technology decision-makers evaluating high-performance computing investments, the paper suggests that hardware roadmaps favoring FP8 over FP64 may not compromise scientific accuracy. The breakdown of FP64 throughput on AI-optimized GPUs (as low as ~1.3 TFLOPS on B300) compared to FP8 throughput (multiple PFLOPS) means that relying on native FP64 could become a bottleneck. The Ozaki Scheme II offers a mathematical guarantee that FP8-based computation can match double-precision results, assuming the proposed composition is implemented efficiently.

Metric	NVIDIA B300 FP64	NVIDIA B300 FP8
Throughput	~1.3 TFLOPS	Multiple PFLOPS
Role in HPC	Traditional requirement	Proposed primitive via Ozaki II

While the paper is analytical and awaits hardware validation, it provides a framework for CTOs to reassess the necessity of FP64-capable hardware in their HPC clusters. The TME model's ability to project performance across architectures (B300, Rubin, H100 baseline) offers a tool for procurement planning.

Matsuoka's work is part of a broader trend in which AI-driven hardware optimizations are reshaping scientific computing. The paper's five-layer hierarchy and the TME model are designed to be extensible, and the author invites the community to test the claims once the follow-on implementation is released. For now, the preprint serves as a provocation: native FP64 may no longer be the holy grail of HPC, and FP8, with the right algorithmic scaffolding, could be all you need.

Sources:

FP8 Debunks FP64 as HPC Holy Grail in New Paper from Satoshi Matsuoka

The Ozaki Scheme II and FP8 Composition

Five-Layer Hierarchy and the TME Model

Implications for Enterprise HPC

Recommended Stories

NVIDIA's GB10 Edge AI Hardware Has No CPU Energy Monitoring, Researchers Find

Snap Launches $2,195 AR Glasses 'Specs' for Consumer Market, Available for Preorder

From Detection to Recovery: Operational Analysis of LLM Pre-training on 504 NVIDIA B200 GPUs

NeuronFabric Architecture Cuts Memory for On-Chip Transformer Training, Promises Efficient Edge AI