Enterprise software buyers deploying large language models for document analysis often face inconsistency in generated answers—especially when dealing with long narrative texts. A new research paper proposes a lightweight approach to address this variability without modifying the underlying model architecture.
The Challenge of Narrative Question Answering
Narrative question answering (NQA) requires models to understand long textual contexts, capture relationships across events, and generate coherent responses, according to the paper by Mohamed, Molham, Hamdi, and Ali from arXiv. Despite recent advances in pretrained language models, most existing approaches rely on a single decoding output during inference, making them sensitive to generation variability and often resulting in incomplete or inconsistent answers.
How Self-Consistency Reranking Works
The proposed self-consistency-based reranking framework is a self-ensemble method that generates multiple candidate answers for each story-question pair. Instead of selecting a single output, the system then chooses the final answer based on semantic agreement among the generated responses. This consensus-based selection allows the model to explore diverse answer formulations while improving robustness, all without modifications to the underlying architecture. The framework combines pretrained and fine-tuned language generation with multi-answer inference and similarity-based reranking.
Experimental Results
The research team evaluated their approach on the NarrativeQA dataset using multiple models, including FLAN-T5 (Base and Small) and Pegasus-Large, under both baseline and fine-tuned settings. The results demonstrate consistent improvements across all models.
| Model | Baseline Accuracy | With Self-Consistency Reranking | Improvement |
|---|---|---|---|
| FLAN-T5-Base | 82.32% | 86.66% | +4.34% |
| FLAN-T5-Small | (not specified) | (improved, exact not given) | - |
| Pegasus-Large | 72.50% | 87.07% | +14.57% |
FLAN-T5-Base achieved the best overall performance, improving from 82.32% to 86.66% (+4.34%) when combined with self-ensemble inference. However, the largest improvement was observed with Pegasus-Large, which increased from 72.50% to 87.07% (+14.57%), highlighting the effectiveness of the proposed strategy for weaker baseline models.
Implications for Enterprise AI For CTOs and technology procurement leaders evaluating LLMs for tasks like contract analysis, compliance review, or document summarization, this research offers a practical method to boost accuracy without retraining or replacing existing models. The technique's architecture-agnostic nature means it can be layered on top of current deployments, potentially reducing error rates in high-stakes narrative understanding. While the experiments focus on the NarrativeQA dataset, the principles of multi-answer generation and semantic agreement apply broadly to any task where consistency matters.