Generating mathematical equations from natural language scientific descriptions is a critical capability for AI systems that could automate tasks in research, engineering, and complex supply chain modelling. However, according to the paper "SciText2Eq: Assessing LLMs for Explainable Equation Generation for Scientific Creativity" on arXiv (June 2026), current large language models (LLMs) perform only moderately on lexical and syntactic similarity and struggle with semantic accuracy when producing equations from scientific text.
The research team—Mo, Yifan; Fu, Xiao; Su, Yue; Meng, Qingyu; Hindriks, Koen; Liu, Qingzhi; and Pei, Jiahuan—identified three key challenges in prior work: unstructured grounding (linking equation elements to raw text), multi-equation dependency (handling equations that reference each other), and human-aligned evaluation (ensuring automated scoring matches expert judgment). To address these, they constructed a dataset of AI research papers, pairing contextual passages with ground-truth equations and variable descriptions.
The SciText2Eq Dataset and Workflow
The dataset underpinning SciText2Eq consists of passages from AI research papers, each paired with the ground-truth equations and descriptions of variables. The team then developed an explainable equation generation workflow and evaluated it across diverse open- and closed-source LLM backbones. The workflow aims to produce not only the equation but also step-by-step explanations, increasing transparency for enterprise users who need to verify model outputs.
Evaluation Protocol: Accuracy, Explainability, and Alignment
The study introduced a three-part evaluation protocol:
- Automatic metrics: Standard lexical and syntactic similarity measures (e.g., BLEU, ROUGE).
- LLM-based rubrics: Using another LLM to score the generated equations on quality.
- Human judgments: Expert annotators evaluated the equations for correctness and explainability.
This combination allowed the researchers to assess accuracy, explainability, and the alignment between human and LLM scoring.
Key Findings
| Evaluation Dimension | LLM Performance |
|---|---|
| Lexical & syntactic similarity | Moderate |
| Semantic accuracy | Poor |
| Alignment between LLM-based and human evaluations | Limited |
The results indicate that while LLMs can capture surface-level patterns, they fail to produce equations that are semantically correct. Furthermore, the limited alignment between LLM-based evaluations and human judgments suggests that using LLMs as automatic evaluators of equation quality is unreliable. The paper notes that these findings "highlight challenges in using LLMs to assess equation quality" and offer insights for improving equation generation models and developing more reliable evaluation methods.
Implications for Enterprise AI
For enterprise technology leaders evaluating LLMs for technical automation—such as converting supply-chain planning rules or engineering formulas into executable models—the SciText2Eq findings underscore the need for rigorous, human-in-the-loop validation. The limited semantic accuracy means that off-the-shelf LLMs may introduce costly errors in equation-driven processes. Researchers have provided their code and data on arXiv for reproducibility (licensed under CC BY-NC-ND 4.0), enabling organisations to test and benchmark their own models against this specialised task.
As the field progresses, combining structured grounding, multi-equation handling, and human-aligned evaluation will be essential to deploying LLMs in scientific and industrial applications where precision is non-negotiable.