Enterprises deploying large-scale AI and high-performance computing (HPC) workloads face a critical bottleneck: optimizing GPU kernels for peak performance. Manual kernel tuning is labor-intensive and requires deep expertise. A new reinforcement learning framework, daVinci-kernel, aims to automate this process by dynamically building and exploiting a skill library. According to a preprint on arXiv, daVinci-kernel jointly trains three agents in a co-evolution loop: a Skill Selection Agent, a Policy Agent, and a Skill Summary Agent.
How daVinci-kernel Works
DaVinci-kernel operates through a cooperative multi-agent system sharing a single LLM backbone. The Skill Selection Agent retrieves relevant techniques using BM25 and LLM reranking. The Policy Agent generates multi-turn CUDA or Triton kernels conditioned on the selected skills. The Skill Summary Agent distills successful rollouts into reusable skills. Crucially, candidate skills are only added after execution-based verification confirms reproducible speedups. The framework is initialized via a structured supervised fine-tuning (SFT) cold start on diversity-filtered data, then jointly optimized end-to-end with multi-turn REINFORCE and per-agent advantage estimation.
Performance on KernelBench
The researchers evaluated daVinci-kernel-14B on the KernelBench benchmark. Under the Fast$_1$ threshold, the model achieved:
| Benchmark Level | Success Rate |
|---|---|
| Level 1 | 37.2% |
| Level 2 | 70.6% |
| Level 3 | 32.2% |
According to the preprint, daVinci-kernel-14B outperformed the strongest prior RL-trained model, this http URL-14B, across all three levels.
Implications for Enterprise AI Infrastructure
While currently focused on GPU kernel optimization, the daVinci-kernel framework demonstrates a generalizable approach to automated code optimization. For enterprise technology teams, this could reduce the time and expertise required to optimize CUDA and Triton kernels for specific hardware configurations. The use of a dynamically evolving skill library that is verified through execution ensures that only proven optimizations are retained. This approach could be extended to other domains where functional correctness is assumed but execution efficiency is the objective, such as database query optimization or network packet processing.
The preprint, authored by Fu, Dayuan, Jiang, Mohan, Wang, Tongyu, Yang, Dian, Hu, Jiarui, Liu, Liming, Hou, Jinlong, and Pengfei, is available on arXiv under the title "daVinci-kernel: Co-Evolving Skill Selection, Summarization, and Utilization via RL for GPU Kernel Optimization."