According to a new research paper published on arXiv by Mathur, Shray, Boscoboinik, J Anibal, Tsai, Esther H R, and Yager, Kevin G, the capabilities of large language models (LLMs) are not improving uniformly. Instead, progress is "jagged," with uneven performance across tasks, domains, and model scales. This jaggedness, the authors argue, can be a resource rather than a limitation—especially for scientific creativity. The paper introduces SciAidanBench, a benchmark designed to measure the scientific idea generation potential of LLMs.
The SciAidanBench Benchmark
SciAidanBench presents LLMs with open-ended scientific questions and tasks them with generating as many unique and coherent ideas as possible. The total number of valid responses serves as a proxy for creative potential. The researchers evaluated 19 base models across 8 providers, totaling 30 variants including reasoning-specific versions. The evaluation covered multiple scientific subfields, providing a broad test of creative capability.
Three Dimensions of Jaggedness
The paper identifies jaggedness at three distinct levels:
- Cross-task jaggedness: Improvements in general creativity do not translate uniformly to scientific creativity. Models that excel at general creative tasks may underperform on scientific ones, revealing divergent capability profiles.
- Prompt-level jaggedness: Even stronger models do not improve uniformly across prompts. They exhibit high variability, with bursts of creativity on some scientific questions and limited performance on others.
- Domain-level jaggedness: Individual models display uneven strengths across scientific subfields, reflecting fragmented internal capability profiles.
| Type of Jaggedness | Description |
|---|---|
| Cross-task | General vs. scientific creativity improvements diverge |
| Prompt-level | High variability across different scientific questions |
| Domain-level | Uneven strengths across scientific subfields |
Harnessing Jaggedness for Better Innovation
Rather than seeing jaggedness as a flaw, the researchers show it can be harnessed. They explore three mechanisms: inference-time compute, knowledge pooling, and brainstorming. By combining models effectively—forming meta-model ensembles—they demonstrate that the ensemble can outperform any single model. This approach positions jaggedness not as a limitation, but as a structural feature of AI progress that, when understood and leveraged, can amplify LLM-driven scientific creativity.
Implications for Enterprise AI Strategy
For enterprise technology leaders, these findings suggest that no single LLM may be optimal for all creative tasks. The jaggedness concept implies that organizations should evaluate models across the specific tasks they intend to use, and consider ensemble strategies to maximize creative output. The paper's methods for combining models—inference-time compute, knowledge pooling, and brainstorming—offer practical pathways to build more robust AI systems for innovation. As LLMs become more prevalent in research and development, understanding and exploiting jaggedness could give enterprises a competitive edge in scientific and technical innovation.