iGEN
Visit IGEN World Explore IGEN Expo
EXPLORE UPGRADE PLANS
BREAKING
Vocabulary Dropout Technique Prevents Diversity Collapse in LLM Co-Evolution Training Bayesian Visualization Helps Humans Negotiate with AI Across Multiple Issues, Study Shows Multi-Sequence Verifiers Cut Inference Latency in Half for LLM Reasoning Language-Guided AI Framework CLARITY Boosts Road Scene Segmentation for Autonomous Logistics When RAG Hurts: Research Identifies Attention Distraction in Vision-Language AI Models and Proposes Mitigation Strait of Hormuz Reopening: Mine Clearance Delays Threaten Weeks-Long Recovery for Oil Shipping India’s REITs and InvITs May Attract Rs 11.6 Lakh Crore Investment by 2030, Avendus Report Says DualGauge: Automated Joint Security-Functionality Benchmarking of Specification-Only Code Generation by LLMs and Coding Agents Nimble SharePower: Modular Power Bank Lets You Share a Charge With a Friend OBCache Prunes KV Cache for Efficient Long-Context LLM Inference with Output-Aware Scoring Vocabulary Dropout Technique Prevents Diversity Collapse in LLM Co-Evolution Training Bayesian Visualization Helps Humans Negotiate with AI Across Multiple Issues, Study Shows Multi-Sequence Verifiers Cut Inference Latency in Half for LLM Reasoning Language-Guided AI Framework CLARITY Boosts Road Scene Segmentation for Autonomous Logistics When RAG Hurts: Research Identifies Attention Distraction in Vision-Language AI Models and Proposes Mitigation Strait of Hormuz Reopening: Mine Clearance Delays Threaten Weeks-Long Recovery for Oil Shipping India’s REITs and InvITs May Attract Rs 11.6 Lakh Crore Investment by 2030, Avendus Report Says DualGauge: Automated Joint Security-Functionality Benchmarking of Specification-Only Code Generation by LLMs and Coding Agents Nimble SharePower: Modular Power Bank Lets You Share a Charge With a Friend OBCache Prunes KV Cache for Efficient Long-Context LLM Inference with Output-Aware Scoring
Home ›› Technology ›› Ai ›› Llms ›› Bridging the Gap: Enabling Natural Language Queries for NoSQL Databases through Text-to-NoSQL Translation

Bridging the Gap: Enabling Natural Language Queries for NoSQL Databases through Text-to-NoSQL Translation

Researchers have introduced TEND, the first execution-verified benchmark for Text-to-NoSQL translation, comprising 1,210 MongoDB-native tasks. They also propose SAG, a Schema-as-Data Grounding solver, to improve query generation for schema-less document stores. Experiments show that LLMs strong at NL2SQL struggle on TEND, validating Text-to-NoSQL as a distinct problem.

iG
iGEN Editorial
June 16, 2026
Bridging the Gap: Enabling Natural Language Queries for NoSQL Databases through Text-to-NoSQL Translation

For enterprises relying on NoSQL databases as their core data infrastructure, the ability to query them using natural language remains underdeveloped. A new research paper from arXiv presents TEND (Text-to-NoSQL Dataset) and SAG (Schema-as-Data Grounding), aiming to bridge this gap for MongoDB aggregation pipelines over schema-less document stores.

According to the paper, correct query generation must recover how a non-relational data model represents entities, nested paths, arrays, missing fields, and dynamic keys. This challenge is more complex than traditional SQL querying because NoSQL databases like MongoDB store data without a fixed schema.

The Challenge of Schema-less Document Stores

NoSQL databases are widely used for their flexibility, but natural-language access to them remains underdeveloped. The authors, including Lu, Jinwei, Jiawei, Zhang, Chen, Qin, Zhiqian, Haodi, Song, Yuanfeng, Wong, and Raymond Chi-Wing, note that translating natural language requests into executable NoSQL queries requires understanding how entities and relationships are encoded in non-relational models. For example, a query must handle nested arrays, optional and sparse paths, and polymorphic shapes—features not present in relational databases.

TEND: An Execution-Verified Benchmark

The paper presents TEND, an execution-verified benchmark with 1,210 MongoDB-native tasks across 11 databases. To the authors' knowledge, TEND is the first Text-to-NoSQL benchmark whose database worlds are MongoDB-native by design. Experts manually defined collection boundaries, nested arrays, optional and sparse paths, polymorphic shapes, and dynamic-key conventions. The worlds are populated with real data and verified through frozen MongoDB execution. This ensures that TEND evaluates schema-less document reasoning rather than SQL-to-MQL transfer.

SAG: Schema-as-Data Grounding Solver

The authors further introduce SAG, a Schema-as-Data Grounding solver. SAG induces path and value grounding from stored-document evidence before bounded MQL generation, followed by execution-grounded repair and result-consistency selection. Evaluation uses bounded column-tolerant execution accuracy (EXC) as the headline metric, complemented by a graded result-set F1 and a mutually exclusive execution-outcome decomposition.

Implications for AI-Powered Data Access

Experiments demonstrate that large language models (LLMs) with strong NL2SQL performance degrade substantially on TEND, validating Text-to-NoSQL as a distinct schema-less document reasoning problem. This finding highlights the need for specialized approaches when applying natural language interfaces to NoSQL databases. For enterprises, this research points to a future where complex querying of diverse data stores becomes more accessible, but significant work remains to match the maturity of SQL-based solutions.

Aspect NL2SQL (Relational) Text-to-NoSQL (Document)
Schema Fixed, known schema Schema-less, dynamic keys
Data model Tables and joins Nested arrays, optional paths
Query generation Mature benchmarks First benchmark (TEND)
LLM performance Strong Substantially degrades

The paper is available on arXiv and represents a foundational step for enabling natural language querying of NoSQL systems, a critical capability for data-driven enterprises managing diverse and flexible data architectures.


Sources:

Keep Reading

Recommended Stories

Beyond Text-to-SQL: New Agentic LLM System Governs Enterprise Analytics APIs Technology

Beyond Text-to-SQL: New Agentic LLM System Governs Enterprise Analytics APIs

Enterprise analytics faces barriers for non-technical users. A new agentic LLM system called Analytic Agent addresses these by translating natural language to secure governed API calls, bypassing raw database access. Evaluated on 90 real enterprise use cases, it validates permissions, executes queries, and generates compliant visualizations.

June 16, 2026
Tree-like Self-Play Framework Teaches LLMs to Fix Security Flaws in Code Generation Technology

Tree-like Self-Play Framework Teaches LLMs to Fix Security Flaws in Code Generation

Researchers introduce Tree-like Self-Play (TSP), a framework that treats secure code generation as a fine-grained sequential decision process. TSP significantly outperforms standard supervised fine-tuning (SFT) and reinforcement learning (RL) on Python security benchmarks, achieving a 75.8% pass rate and reducing unseen vulnerabilities by 24.5% while generalising across programming languages.

June 16, 2026
Haiku to Opus in Just 10 bits: LLMs Unlock Large Compression Gains Technology

Haiku to Opus in Just 10 bits: LLMs Unlock Large Compression Gains

A new arXiv paper presents methods for compressing LLM-generated text, achieving over 100x reduction in data transfer compared to prior techniques. Lossless compression via domain-adapted LoRA adapters doubles efficiency, while an interactive Question-Asking protocol recovers up to 72% of the capability gap between small and large models using only 10 binary questions.

June 16, 2026
Study Finds Persistent Cooperative Bias in Next-Gen LLM Agents but Significant Provider Divergence Technology

Study Finds Persistent Cooperative Bias in Next-Gen LLM Agents but Significant Provider Divergence

A new study by Bolívar and Zúñiga extends previous benchmarks on cooperative behavior in LLM agent systems, testing four frontier models from Anthropic, Google, and OpenAI. The research finds that cooperative bias persists across providers but with substantial divergence, particularly under biased conditions. Noise remains a universal challenge.

June 16, 2026