SciText2Eq Study: LLMs Show Limited Accuracy in Generating Equations from Scientific Text for Enterprise AI

A new paper, SciText2Eq, evaluates large language models (LLMs) on generating mathematical equations from scientific texts. The study constructed a dataset from AI research papers and introduced a multi-faceted evaluation protocol. Results show that LLMs achieve only moderate lexical similarity and suffer from poor semantic accuracy, and that LLM-based evaluations correlate poorly with human judgments, highlighting challenges for reliable AI in technical domains.

iGEN Editorial

June 16, 2026

SciText2Eq Study: LLMs Show Limited Accuracy in Generating Equations from Scientific Text for Enterprise AI

Generating mathematical equations from natural language scientific descriptions is a critical capability for AI systems that could automate tasks in research, engineering, and complex supply chain modelling. However, according to the paper "SciText2Eq: Assessing LLMs for Explainable Equation Generation for Scientific Creativity" on arXiv (June 2026), current large language models (LLMs) perform only moderately on lexical and syntactic similarity and struggle with semantic accuracy when producing equations from scientific text.

The research team—Mo, Yifan; Fu, Xiao; Su, Yue; Meng, Qingyu; Hindriks, Koen; Liu, Qingzhi; and Pei, Jiahuan—identified three key challenges in prior work: unstructured grounding (linking equation elements to raw text), multi-equation dependency (handling equations that reference each other), and human-aligned evaluation (ensuring automated scoring matches expert judgment). To address these, they constructed a dataset of AI research papers, pairing contextual passages with ground-truth equations and variable descriptions.

The SciText2Eq Dataset and Workflow

The dataset underpinning SciText2Eq consists of passages from AI research papers, each paired with the ground-truth equations and descriptions of variables. The team then developed an explainable equation generation workflow and evaluated it across diverse open- and closed-source LLM backbones. The workflow aims to produce not only the equation but also step-by-step explanations, increasing transparency for enterprise users who need to verify model outputs.

Evaluation Protocol: Accuracy, Explainability, and Alignment

The study introduced a three-part evaluation protocol:

Automatic metrics: Standard lexical and syntactic similarity measures (e.g., BLEU, ROUGE).
LLM-based rubrics: Using another LLM to score the generated equations on quality.
Human judgments: Expert annotators evaluated the equations for correctness and explainability.

This combination allowed the researchers to assess accuracy, explainability, and the alignment between human and LLM scoring.

Key Findings

Evaluation Dimension	LLM Performance
Lexical & syntactic similarity	Moderate
Semantic accuracy	Poor
Alignment between LLM-based and human evaluations	Limited

The results indicate that while LLMs can capture surface-level patterns, they fail to produce equations that are semantically correct. Furthermore, the limited alignment between LLM-based evaluations and human judgments suggests that using LLMs as automatic evaluators of equation quality is unreliable. The paper notes that these findings "highlight challenges in using LLMs to assess equation quality" and offer insights for improving equation generation models and developing more reliable evaluation methods.

Implications for Enterprise AI

For enterprise technology leaders evaluating LLMs for technical automation—such as converting supply-chain planning rules or engineering formulas into executable models—the SciText2Eq findings underscore the need for rigorous, human-in-the-loop validation. The limited semantic accuracy means that off-the-shelf LLMs may introduce costly errors in equation-driven processes. Researchers have provided their code and data on arXiv for reproducibility (licensed under CC BY-NC-ND 4.0), enabling organisations to test and benchmark their own models against this specialised task.

As the field progresses, combining structured grounding, multi-equation handling, and human-aligned evaluation will be essential to deploying LLMs in scientific and industrial applications where precision is non-negotiable.

Sources:

SciText2Eq Study: LLMs Show Limited Accuracy in Generating Equations from Scientific Text for Enterprise AI

The SciText2Eq Dataset and Workflow

Evaluation Protocol: Accuracy, Explainability, and Alignment

Key Findings

Implications for Enterprise AI

Recommended Stories

LLM Jaggedness Unlocks Scientific Creativity: New Benchmark Reveals Uneven AI Capabilities Can Be Harnessed for Innovation

Self-Improving AI Isn't Just for Frontier Labs: How Enterprises Can Build Their Own

DiverseDistill: New Knowledge Distillation Method Recovers Over 70% of Performance Gap Using Teacher Committees

Reinforcement-Aware Knowledge Distillation Boosts LLM Reasoning Efficiency