The assumption that native hardware FP64 is the irreducible foundation of scientific computing is being directly challenged in a new paper on arXiv. Authored by Satoshi Matsuoka, the preprint titled "FP8 is All You Need (Part 1): Debunking Hardware FP64 as the HPC Holy Grail (June 13th version)" argues that on AI-optimized GPUs of the NVIDIA B300 generation and beyond, FP8 tensor throughput has grown to multiple PFLOPS while native FP64 throughput has collapsed to approximately 1.3 TFLOPS. The paper claims this shift is not just survivable but preferable: the FP8 tensor-core matrix-multiply can serve as the sole computational primitive for double-precision scientific computing.
The Ozaki Scheme II and FP8 Composition
According to the paper, every canonical kernel in scientific computing—dense and sparse linear algebra, spectral transforms, stencils—along with every application composing them, can be reduced via the Ozaki Scheme II to sequences of FP8 matrix operations. The Ozaki Scheme II relies on the Chinese Remainder Theorem to maintain accuracy. The only non-FP8 arithmetic involved is a bounded, fixed-width integer accumulation at reconstruction. This approach demotes native FP64 from a hardware requirement to a derived accuracy guarantee obtained by composition over the FP8 primitive.
Five-Layer Hierarchy and the TME Model
Matsuoka organizes the claim as a five-layer hierarchy: the FP8 op, Ozaki II, the basic kernels or Berkeley "dwarfs", composite solvers, and full applications. Because the dwarf taxonomy already spans scientific computing, the paper establishes the claim by exhibiting the reduction for every dwarf rather than a sample. The claim is falsifiable, and the paper builds an instrument to test it: a Tensor-Memory Equilibrium (TME) model that extends the Roofline model with emulation parameters designated as alpha, beta, and gamma.
The TME model identifies register-level fusion as the mechanism that keeps emulation memory-bound. It projects recovered FP64 performance across NVIDIA's B300 and Rubin architectures against an H100 baseline. The model could have returned a negative verdict, but according to the paper, it passes across the dwarfs and their compositions. This is the analytical half of a two-part program, with a follow-on implementation to validate the thesis on real silicon.
Implications for Enterprise HPC
For enterprise technology decision-makers evaluating high-performance computing investments, the paper suggests that hardware roadmaps favoring FP8 over FP64 may not compromise scientific accuracy. The breakdown of FP64 throughput on AI-optimized GPUs (as low as ~1.3 TFLOPS on B300) compared to FP8 throughput (multiple PFLOPS) means that relying on native FP64 could become a bottleneck. The Ozaki Scheme II offers a mathematical guarantee that FP8-based computation can match double-precision results, assuming the proposed composition is implemented efficiently.
| Metric | NVIDIA B300 FP64 | NVIDIA B300 FP8 |
|---|---|---|
| Throughput | ~1.3 TFLOPS | Multiple PFLOPS |
| Role in HPC | Traditional requirement | Proposed primitive via Ozaki II |
While the paper is analytical and awaits hardware validation, it provides a framework for CTOs to reassess the necessity of FP64-capable hardware in their HPC clusters. The TME model's ability to project performance across architectures (B300, Rubin, H100 baseline) offers a tool for procurement planning.
Matsuoka's work is part of a broader trend in which AI-driven hardware optimizations are reshaping scientific computing. The paper's five-layer hierarchy and the TME model are designed to be extensible, and the author invites the community to test the claims once the follow-on implementation is released. For now, the preprint serves as a provocation: native FP64 may no longer be the holy grail of HPC, and FP8, with the right algorithmic scaffolding, could be all you need.