Self-Consistency Reranking Boosts Accuracy in Narrative Question Answering for Enterprise AI

Researchers propose a self-consistency-based reranking framework for narrative question answering that generates multiple candidates and selects the final answer by semantic agreement. On the NarrativeQA dataset, FLAN-T5-Base improved from 82.32% to 86.66%, and Pegasus-Large jumped from 72.50% to 87.07%. The method requires no architectural changes, making it a drop-in enhancement for enterprise language models.

iGEN Editorial

June 16, 2026

Self-Consistency Reranking Boosts Accuracy in Narrative Question Answering for Enterprise AI

Enterprise software buyers deploying large language models for document analysis often face inconsistency in generated answers—especially when dealing with long narrative texts. A new research paper proposes a lightweight approach to address this variability without modifying the underlying model architecture.

The Challenge of Narrative Question Answering

Narrative question answering (NQA) requires models to understand long textual contexts, capture relationships across events, and generate coherent responses, according to the paper by Mohamed, Molham, Hamdi, and Ali from arXiv. Despite recent advances in pretrained language models, most existing approaches rely on a single decoding output during inference, making them sensitive to generation variability and often resulting in incomplete or inconsistent answers.

How Self-Consistency Reranking Works

The proposed self-consistency-based reranking framework is a self-ensemble method that generates multiple candidate answers for each story-question pair. Instead of selecting a single output, the system then chooses the final answer based on semantic agreement among the generated responses. This consensus-based selection allows the model to explore diverse answer formulations while improving robustness, all without modifications to the underlying architecture. The framework combines pretrained and fine-tuned language generation with multi-answer inference and similarity-based reranking.

Experimental Results

The research team evaluated their approach on the NarrativeQA dataset using multiple models, including FLAN-T5 (Base and Small) and Pegasus-Large, under both baseline and fine-tuned settings. The results demonstrate consistent improvements across all models.

Model	Baseline Accuracy	With Self-Consistency Reranking	Improvement
FLAN-T5-Base	82.32%	86.66%	+4.34%
FLAN-T5-Small	(not specified)	(improved, exact not given)	-
Pegasus-Large	72.50%	87.07%	+14.57%

FLAN-T5-Base achieved the best overall performance, improving from 82.32% to 86.66% (+4.34%) when combined with self-ensemble inference. However, the largest improvement was observed with Pegasus-Large, which increased from 72.50% to 87.07% (+14.57%), highlighting the effectiveness of the proposed strategy for weaker baseline models.

Implications for Enterprise AI For CTOs and technology procurement leaders evaluating LLMs for tasks like contract analysis, compliance review, or document summarization, this research offers a practical method to boost accuracy without retraining or replacing existing models. The technique's architecture-agnostic nature means it can be layered on top of current deployments, potentially reducing error rates in high-stakes narrative understanding. While the experiments focus on the NarrativeQA dataset, the principles of multi-answer generation and semantic agreement apply broadly to any task where consistency matters.

Sources:

Self-Consistency Reranking Boosts Accuracy in Narrative Question Answering for Enterprise AI

Recommended Stories

EHRNote-ChatQA: New Benchmark Tests LLMs on Multi-Turn Clinical Question Answering

Creating Multilingual Mental Health Datasets: Study Reveals Limits of Persona-Based Localization via Nationality and Language

CREDENCE Framework Improves Automated Fact-Checking with Semantic Metrics and Convergence Analysis

Large Language Models Can Read Compressed Text That Humans Cannot, Researchers Find