On-chip training of large language models (LLMs) typically relies on external memory and host orchestration to handle optimizer-state updates, increasing latency and power consumption. According to a paper published on arXiv by Evgeny Ukladchikov, a new software reference architecture called NeuronFabric enables transformer training with local Adam updates entirely on-chip, without external machine-learning frameworks. The architecture is designed for future FPGA and ASIC implementations and is validated through a complete C# prototype that implements forward pass, backpropagation, and Adam optimization.
The prototype trains a 334K-parameter autoregressive transformer (dimensions: d=88, H=4, f=264, L=4, vocab=256) on the Shakespeare corpus. The key innovation is BF16W, a mixed-precision format that stores weights in BF16 while retaining Adam optimizer moments in FP32. This reduces memory footprint for on-chip training without sacrificing numerical accuracy.
Memory Savings with BF16W
A standard FP32 implementation of the 334K-parameter model with Adam moments requires approximately 4.0 MB, which matches the BRAM capacity of a Xilinx ZCU102 device. The BF16W variant requires only 3.34 MB, freeing memory for activation storage. The table below summarises the memory comparison:
| Configuration | Memory Required | Format |
|---|---|---|
| FP32 model + Adam moments | 4.0 MB | Full FP32 |
| BF16W (weights BF16, moments FP32) | 3.34 MB | Mixed-precision |
According to the paper, the BF16W approach reduces memory requirements by approximately 16.5% while maintaining the precision needed for optimizer updates. This makes it feasible to perform full training on a single FPGA device without external DRAM.
Validation Results
The BF16W configuration achieves an evaluation loss of 1.5426 after 80K training samples, compared to 1.5224 for an FP32 GPU reference. The paper reports that the BF16W model produces coherent character-level text, indicating that the reduced precision does not degrade output quality significantly. The author notes that earlier experiments revealed a vocabulary-budget constraint, which was addressed in the current design.
The prototype runs entirely in C# without external machine-learning frameworks, validating numerical correctness and memory requirements before hardware implementation. The author states that this publication serves as a public architectural disclosure and software reference implementation for future hardware exploration.
Next Steps: FPGA Training
The paper outlines FPGA training as the next stage of development. No FPGA measurements are included in this paper, but the architecture is intended to guide future FPGA and ASIC designs. The memory savings demonstrated by BF16W are critical for enabling on-chip training on devices with limited BRAM, such as the Xilinx ZCU102.
For enterprise technology leaders evaluating AI hardware, NeuronFabric offers a potential pathway to deploy on-chip LLM training without dependence on cloud GPU clusters. While still at the reference-architecture stage, the approach could reduce energy consumption and latency for edge-based AI applications where real-time model adaptation is required. The open disclosure of the architecture allows hardware vendors and system integrators to begin exploring custom accelerator designs.