iGEN
Visit IGEN World Explore IGEN Expo
EXPLORE UPGRADE PLANS
BREAKING
New Survey Unifies LLM Policy Optimization Methods on First Principles from REINFORCE to GRPO Neuro-Symbolic Framework Improves Motion Prediction for Autonomous Vehicles in Mixed Traffic AI Scientist Automates Entire Research Lifecycle, Passes First Peer Review AI-driven Landmark-free Assessment of Lower-limb Alignment with Implicit Neural Shape Functions from Knee Radiographs Quantum Machine Learning for Industrial Applications: New Research Tackles Trainability and Expressivity New Method Resolves Drift Attribution Ambiguity in LLM Evaluation Pipelines New Hardware-Aware Neural Architecture Search Runs on Embedded Devices with Under 512MB RAM Malaysia's AI Agent-Powered Messaging Platform Respond.io Raises $62.5M, Targets Acquisitions MimicIK Framework Achieves Real-Time Inverse Kinematics with 4.65 mm Accuracy for Robotic Teleoperation Reward Hacking Still Undefeated: AI Safety Gridworlds Test Shows Exploits Persist Across LLM Scales New Survey Unifies LLM Policy Optimization Methods on First Principles from REINFORCE to GRPO Neuro-Symbolic Framework Improves Motion Prediction for Autonomous Vehicles in Mixed Traffic AI Scientist Automates Entire Research Lifecycle, Passes First Peer Review AI-driven Landmark-free Assessment of Lower-limb Alignment with Implicit Neural Shape Functions from Knee Radiographs Quantum Machine Learning for Industrial Applications: New Research Tackles Trainability and Expressivity New Method Resolves Drift Attribution Ambiguity in LLM Evaluation Pipelines New Hardware-Aware Neural Architecture Search Runs on Embedded Devices with Under 512MB RAM Malaysia's AI Agent-Powered Messaging Platform Respond.io Raises $62.5M, Targets Acquisitions MimicIK Framework Achieves Real-Time Inverse Kinematics with 4.65 mm Accuracy for Robotic Teleoperation Reward Hacking Still Undefeated: AI Safety Gridworlds Test Shows Exploits Persist Across LLM Scales
Home ›› Technology ›› Ai ›› Llms ›› SciText2Eq Study: LLMs Show Limited Accuracy in Generating Equations from Scientific Text for Enterprise AI

SciText2Eq Study: LLMs Show Limited Accuracy in Generating Equations from Scientific Text for Enterprise AI

A new paper, SciText2Eq, evaluates large language models (LLMs) on generating mathematical equations from scientific texts. The study constructed a dataset from AI research papers and introduced a multi-faceted evaluation protocol. Results show that LLMs achieve only moderate lexical similarity and suffer from poor semantic accuracy, and that LLM-based evaluations correlate poorly with human judgments, highlighting challenges for reliable AI in technical domains.

iG
iGEN Editorial
June 16, 2026
SciText2Eq Study: LLMs Show Limited Accuracy in Generating Equations from Scientific Text for Enterprise AI

Generating mathematical equations from natural language scientific descriptions is a critical capability for AI systems that could automate tasks in research, engineering, and complex supply chain modelling. However, according to the paper "SciText2Eq: Assessing LLMs for Explainable Equation Generation for Scientific Creativity" on arXiv (June 2026), current large language models (LLMs) perform only moderately on lexical and syntactic similarity and struggle with semantic accuracy when producing equations from scientific text.

The research team—Mo, Yifan; Fu, Xiao; Su, Yue; Meng, Qingyu; Hindriks, Koen; Liu, Qingzhi; and Pei, Jiahuan—identified three key challenges in prior work: unstructured grounding (linking equation elements to raw text), multi-equation dependency (handling equations that reference each other), and human-aligned evaluation (ensuring automated scoring matches expert judgment). To address these, they constructed a dataset of AI research papers, pairing contextual passages with ground-truth equations and variable descriptions.

The SciText2Eq Dataset and Workflow

The dataset underpinning SciText2Eq consists of passages from AI research papers, each paired with the ground-truth equations and descriptions of variables. The team then developed an explainable equation generation workflow and evaluated it across diverse open- and closed-source LLM backbones. The workflow aims to produce not only the equation but also step-by-step explanations, increasing transparency for enterprise users who need to verify model outputs.

Evaluation Protocol: Accuracy, Explainability, and Alignment

The study introduced a three-part evaluation protocol:

  • Automatic metrics: Standard lexical and syntactic similarity measures (e.g., BLEU, ROUGE).
  • LLM-based rubrics: Using another LLM to score the generated equations on quality.
  • Human judgments: Expert annotators evaluated the equations for correctness and explainability.

This combination allowed the researchers to assess accuracy, explainability, and the alignment between human and LLM scoring.

Key Findings

Evaluation Dimension LLM Performance
Lexical & syntactic similarity Moderate
Semantic accuracy Poor
Alignment between LLM-based and human evaluations Limited

The results indicate that while LLMs can capture surface-level patterns, they fail to produce equations that are semantically correct. Furthermore, the limited alignment between LLM-based evaluations and human judgments suggests that using LLMs as automatic evaluators of equation quality is unreliable. The paper notes that these findings "highlight challenges in using LLMs to assess equation quality" and offer insights for improving equation generation models and developing more reliable evaluation methods.

Implications for Enterprise AI

For enterprise technology leaders evaluating LLMs for technical automation—such as converting supply-chain planning rules or engineering formulas into executable models—the SciText2Eq findings underscore the need for rigorous, human-in-the-loop validation. The limited semantic accuracy means that off-the-shelf LLMs may introduce costly errors in equation-driven processes. Researchers have provided their code and data on arXiv for reproducibility (licensed under CC BY-NC-ND 4.0), enabling organisations to test and benchmark their own models against this specialised task.

As the field progresses, combining structured grounding, multi-equation handling, and human-aligned evaluation will be essential to deploying LLMs in scientific and industrial applications where precision is non-negotiable.


Sources:

Keep Reading

Recommended Stories

Data Augmentations Offer Path to Efficient Language Model Pretraining Under Data Constraints Technology

Data Augmentations Offer Path to Efficient Language Model Pretraining Under Data Constraints

As AI labs face a data ceiling where compute capacity outpaces new high-quality text, researchers propose data augmentations to enable productive multi-epoch training on fixed corpora. Three categories—token-level noise, sequence permutations, and target offset prediction—are shown to delay overfitting and lower validation loss compared to standard autoregressive pretraining. Random token replacement achieved the best minimum loss among individual methods, with combined augmentations further improving results.

June 16, 2026
New Survey Unifies LLM Policy Optimization Methods on First Principles from REINFORCE to GRPO Technology

New Survey Unifies LLM Policy Optimization Methods on First Principles from REINFORCE to GRPO

A new survey on arXiv revisits LLM policy optimization from first principles, modeling all methods as modifications of either the trajectory probability or reward function. It covers the path from REINFORCE to GRPO and beyond, identifying compound failures that require joint design of both sides.

June 16, 2026
AI Scientist Automates Entire Research Lifecycle, Passes First Peer Review Technology

AI Scientist Automates Entire Research Lifecycle, Passes First Peer Review

A new AI system called The AI Scientist can autonomously conduct the entire research lifecycle, from idea generation to manuscript writing and peer review. It produced a paper that passed the first round of peer review at a major machine learning conference workshop with a 70% acceptance rate. The system operates in both a focused mode using human-provided templates and a template-free open-ended mode.

June 16, 2026
New Hardware-Aware Neural Architecture Search Runs on Embedded Devices with Under 512MB RAM Technology

New Hardware-Aware Neural Architecture Search Runs on Embedded Devices with Under 512MB RAM

Researchers propose a hardware-aware neural architecture search (HW NAS) method that runs on embedded devices with under 512MB of RAM. It produces tiny convolutional neural networks for low-end microcontrollers, enabling on-device AI without cloud dependence. The approach achieves state-of-the-art results on the Visual Wake Word dataset.

June 16, 2026