iGEN
Visit IGEN World Explore IGEN Expo
EXPLORE UPGRADE PLANS
BREAKING
Multiple Factors Set to Reset Ocean Rates in Coming Weeks Orcheo: An Open-Source Modular Full-Stack Platform for Conversational Search First Model-Free Universal AI Agent Proved Asymptotically Optimal in General Reinforcement Learning AuAu Benchmark Audits Authoritarian Alignment in Large Language Models from Four Regions VinQA Dataset Enables Multimodal Document QA with Interleaved Visual Elements for Enterprise AI AlignCoder Uses Reinforcement Learning to Improve Repository-Level Code Completion by 18% New Fluid-Guided Algorithm Optimizes LLM Inference Scheduling Under Memory Constraints LLM-Driven World Simulation: New Framework Formalizes Game Master as Parameterized-Action POMDP India's Record Rice and Wheat Stocks Bolster Exports Amid El Niño Risks FlowState: New Time-Series Model Handles Any Sampling Rate Without Retraining Multiple Factors Set to Reset Ocean Rates in Coming Weeks Orcheo: An Open-Source Modular Full-Stack Platform for Conversational Search First Model-Free Universal AI Agent Proved Asymptotically Optimal in General Reinforcement Learning AuAu Benchmark Audits Authoritarian Alignment in Large Language Models from Four Regions VinQA Dataset Enables Multimodal Document QA with Interleaved Visual Elements for Enterprise AI AlignCoder Uses Reinforcement Learning to Improve Repository-Level Code Completion by 18% New Fluid-Guided Algorithm Optimizes LLM Inference Scheduling Under Memory Constraints LLM-Driven World Simulation: New Framework Formalizes Game Master as Parameterized-Action POMDP India's Record Rice and Wheat Stocks Bolster Exports Amid El Niño Risks FlowState: New Time-Series Model Handles Any Sampling Rate Without Retraining
Home ›› Technology ›› Ai ›› Llms ›› Self-Consistency Reranking Boosts Accuracy in Narrative Question Answering for Enterprise AI

Self-Consistency Reranking Boosts Accuracy in Narrative Question Answering for Enterprise AI

Researchers propose a self-consistency-based reranking framework for narrative question answering that generates multiple candidates and selects the final answer by semantic agreement. On the NarrativeQA dataset, FLAN-T5-Base improved from 82.32% to 86.66%, and Pegasus-Large jumped from 72.50% to 87.07%. The method requires no architectural changes, making it a drop-in enhancement for enterprise language models.

iG
iGEN Editorial
June 16, 2026
Self-Consistency Reranking Boosts Accuracy in Narrative Question Answering for Enterprise AI

Enterprise software buyers deploying large language models for document analysis often face inconsistency in generated answers—especially when dealing with long narrative texts. A new research paper proposes a lightweight approach to address this variability without modifying the underlying model architecture.

The Challenge of Narrative Question Answering

Narrative question answering (NQA) requires models to understand long textual contexts, capture relationships across events, and generate coherent responses, according to the paper by Mohamed, Molham, Hamdi, and Ali from arXiv. Despite recent advances in pretrained language models, most existing approaches rely on a single decoding output during inference, making them sensitive to generation variability and often resulting in incomplete or inconsistent answers.

How Self-Consistency Reranking Works

The proposed self-consistency-based reranking framework is a self-ensemble method that generates multiple candidate answers for each story-question pair. Instead of selecting a single output, the system then chooses the final answer based on semantic agreement among the generated responses. This consensus-based selection allows the model to explore diverse answer formulations while improving robustness, all without modifications to the underlying architecture. The framework combines pretrained and fine-tuned language generation with multi-answer inference and similarity-based reranking.

Experimental Results

The research team evaluated their approach on the NarrativeQA dataset using multiple models, including FLAN-T5 (Base and Small) and Pegasus-Large, under both baseline and fine-tuned settings. The results demonstrate consistent improvements across all models.

Model Baseline Accuracy With Self-Consistency Reranking Improvement
FLAN-T5-Base 82.32% 86.66% +4.34%
FLAN-T5-Small (not specified) (improved, exact not given) -
Pegasus-Large 72.50% 87.07% +14.57%

FLAN-T5-Base achieved the best overall performance, improving from 82.32% to 86.66% (+4.34%) when combined with self-ensemble inference. However, the largest improvement was observed with Pegasus-Large, which increased from 72.50% to 87.07% (+14.57%), highlighting the effectiveness of the proposed strategy for weaker baseline models.

Implications for Enterprise AI For CTOs and technology procurement leaders evaluating LLMs for tasks like contract analysis, compliance review, or document summarization, this research offers a practical method to boost accuracy without retraining or replacing existing models. The technique's architecture-agnostic nature means it can be layered on top of current deployments, potentially reducing error rates in high-stakes narrative understanding. While the experiments focus on the NarrativeQA dataset, the principles of multi-answer generation and semantic agreement apply broadly to any task where consistency matters.


Sources:

Keep Reading

Recommended Stories

EHRNote-ChatQA: New Benchmark Tests LLMs on Multi-Turn Clinical Question Answering Technology

EHRNote-ChatQA: New Benchmark Tests LLMs on Multi-Turn Clinical Question Answering

Researchers introduce EHRNote-ChatQA, the first benchmark for evidence-grounded multi-turn clinical question answering over multiple discharge summaries. Built from MIMIC-IV data, it contains 967 patient-level samples and 16,072 QA pairs, revealing that LLMs struggle more with evidence grounding than content answering and that multi-turn errors compound.

June 16, 2026
VinQA Dataset Enables Multimodal Document QA with Interleaved Visual Elements for Enterprise AI Technology

VinQA Dataset Enables Multimodal Document QA with Interleaved Visual Elements for Enterprise AI

A new dataset called VinQA targets long-form answer generation in multimodal document QA, where cited visual elements are interleaved with text. The paper compares two encoding methods and an evaluation framework, showing that fine-tuning open Qwen2.5-VL models can approach proprietary frontier model performance.

June 16, 2026
P3B3 Benchmark Reveals Strong Brazilian Portuguese Bias in Large Language Models Technology

P3B3 Benchmark Reveals Strong Brazilian Portuguese Bias in Large Language Models

According to a new research paper, a team introduced P3B3, an expert-curated benchmark for measuring bias between European and Brazilian Portuguese in large language models. Experiments show most LLMs strongly prefer Brazilian Portuguese, underscoring the need for more balanced variety representation in conversational AI.

June 16, 2026
New Research Reveals Truthfulness Preserved Across LLM Lineages, Enabling Better Hallucination Control Technology

New Research Reveals Truthfulness Preserved Across LLM Lineages, Enabling Better Hallucination Control

A new paper from researchers shows that truthfulness-related attention heads are preserved across generations of large language models, even after instruction tuning or multimodal adaptation. The authors propose TruthProbe, a soft-gating strategy that amplifies these heads to reduce hallucinations, with improvements on HaluEval, POPE, and CHAIR benchmarks.

June 16, 2026