iGEN
Visit IGEN World Explore IGEN Expo
EXPLORE UPGRADE PLANS
BREAKING
New Architecture GRIL Enables Gradient Descent-Like Learning in Linear Recurrent Networks ToolSelf AI Agents Achieve 28.8 Point Gain Through Runtime Self-Reconfiguration ArtNet: JEPA-Like Articulatory Framework Achieves 20.56% Error Reduction in Zero-Shot Phoneme Recognition LLM-Assisted Stance Detection in Scientific Discourse Reaches 0.76 Combined Reliability Score New Drift-RAE Method Distills Transformers Efficiently Using Representation Autoencoders Cough Regression Benchmark Reveals Trade-Offs in Respiratory Acoustic Foundation Models Spacex Acquires AI Coding Startup Cursor For $60bn Days After Bumper IPO Metacognitive Myopia in LLMs: New Framework Reveals Hidden Biases with High-Stakes Implications Lightweight Hardware-Aware Neural Architecture Search Enables CNNs on Ultra-Low-Power Microcontrollers Researchers Develop Method to Read and Steer Language Models' Internal Value Priorities New Architecture GRIL Enables Gradient Descent-Like Learning in Linear Recurrent Networks ToolSelf AI Agents Achieve 28.8 Point Gain Through Runtime Self-Reconfiguration ArtNet: JEPA-Like Articulatory Framework Achieves 20.56% Error Reduction in Zero-Shot Phoneme Recognition LLM-Assisted Stance Detection in Scientific Discourse Reaches 0.76 Combined Reliability Score New Drift-RAE Method Distills Transformers Efficiently Using Representation Autoencoders Cough Regression Benchmark Reveals Trade-Offs in Respiratory Acoustic Foundation Models Spacex Acquires AI Coding Startup Cursor For $60bn Days After Bumper IPO Metacognitive Myopia in LLMs: New Framework Reveals Hidden Biases with High-Stakes Implications Lightweight Hardware-Aware Neural Architecture Search Enables CNNs on Ultra-Low-Power Microcontrollers Researchers Develop Method to Read and Steer Language Models' Internal Value Priorities
Home ›› Technology ›› Ai ›› Llms ›› PVminerLLM2 Uses Preference Optimization to Improve Structured Patient Voice Extraction

PVminerLLM2 Uses Preference Optimization to Improve Structured Patient Voice Extraction

Researchers introduce PVminerLLM2, an improved set of LLMs for structured extraction of patient voice from unstructured text. The model uses preference optimization with token-level gated stabilization and confusion-aware pair construction to outperform supervised fine-tuning baselines. The code and trained models are publicly available.

iG
iGEN Editorial
June 16, 2026
PVminerLLM2 Uses Preference Optimization to Improve Structured Patient Voice Extraction

Unstructured patient-generated text captures critical information about lived experiences, social context, and care engagement, but its clinical value remains locked without structured extraction. A new family of language models, PVminerLLM2, aims to unlock that data by applying preference optimization — a technique that refines model outputs beyond what supervised fine-tuning (SFT) can achieve.

Limitations of Supervised Fine-Tuning

Prior work established the PV-Miner benchmark and the PVMinerLLM models for structured extraction of patient voice. However, according to the researchers, supervised fine-tuning alone struggles with rare, fine-grained, and unevenly distributed errors, particularly in token-critical structured outputs. These errors — such as mislabeling a single token in a clinical code — can render an extraction useless.

Key Technical Innovations

The team behind PVminerLLM2 introduces four main innovations to overcome SFT limitations:

  • Token-level gated stabilization term — prevents degradation of absolute token likelihood under preference optimization, ensuring the model does not forget high-confidence tokens while learning from preferences.
  • Confusion-aware preference pair construction — better captures low-separation distinctions by deliberately constructing training pairs from tokens the model finds hardest to distinguish.
  • Token-importance weighting — assigns higher weight to tokens critical for correct extraction.
  • Inverse-frequency reweighing — addresses token imbalance and class skew, common in medical text where certain codes appear far more often than others.

Performance Gains Across Metrics

Evaluated across multiple model sizes, PVminerLLM2 consistently outperformed strong baselines, including baseline LLMs trained with existing preference optimization methods. The improvements are summarized below:

Metric Gain over Baseline
Code 4.43%
Sub-code 3.50%
Span 1.55%

These gains, while modest in absolute terms, represent substantial reductions in token-critical errors for structured extraction tasks.

Availability and Implementation

The supplementary material, code, evaluation scripts, and trained models for PVminerLLM2 are publicly available at the project's URL on arXiv. This open release allows other researchers and enterprises to apply the same preference optimization techniques to their own structured extraction problems — whether in healthcare documentation, clinical trial data mining, or other domains where token-level accuracy is paramount.

The research was authored by Fodeh, Samah, Ma, Linhai, Puthiaraju, Ganesh, Talakokkul, Srivani, Khan, Afshan, Irankhah, Elyas, Ramachandran, Sreeraj, Hagaman, Ashley, Lowe, Sarah, and Roundtree, Aimee, and appeared on arXiv as a Computer Science paper under the title "PVminerLLM2: Improving Structured Extraction of Patient Voice via Preference Optimization".


Sources:

Keep Reading

Recommended Stories

EHRNote-ChatQA: New Benchmark Tests LLMs on Multi-Turn Clinical Question Answering Technology

EHRNote-ChatQA: New Benchmark Tests LLMs on Multi-Turn Clinical Question Answering

Researchers introduce EHRNote-ChatQA, the first benchmark for evidence-grounded multi-turn clinical question answering over multiple discharge summaries. Built from MIMIC-IV data, it contains 967 patient-level samples and 16,072 QA pairs, revealing that LLMs struggle more with evidence grounding than content answering and that multi-turn errors compound.

June 16, 2026
Microsoft Copilot AI deployed to 505,000 NHS England staff after world's largest healthcare AI trial Technology

Microsoft Copilot AI deployed to 505,000 NHS England staff after world's largest healthcare AI trial

NHS England is rolling out Microsoft 365 Copilot to over half a million clinicians and support staff following a 30,000-user pilot that saved an average of 43 minutes per day. The initiative aims to reduce administrative burden and free up time for patient care.

June 14, 2026
LLM-Assisted Stance Detection in Scientific Discourse Reaches 0.76 Combined Reliability Score Technology

LLM-Assisted Stance Detection in Scientific Discourse Reaches 0.76 Combined Reliability Score

Researchers used GPT-5.1, Claude Sonnet 4.6, and Gemini 3 Pro to detect whether scientific authors treat Bayesian models as realistic or instrumental. The LLMs achieved a held-out combined reliability of 0.76 and near-perfect article-level rank stability (r=0.96-0.97). The study demonstrates a scalable method for theoretically demanding qualitative coding.

June 16, 2026
LLM-WikiRace Benchmark Reveals Frontier AI Models Still Struggle with Planning Over Knowledge Graphs Technology

LLM-WikiRace Benchmark Reveals Frontier AI Models Still Struggle with Planning Over Knowledge Graphs

Researchers introduced LLM-WikiRace, a benchmark to evaluate large language models on planning, reasoning, and world knowledge using Wikipedia hyperlinks. Top models like Gemini-3, GPT-5, and Claude Opus 4.5 achieve superhuman performance on easy tasks but drop sharply on hard difficulty, with Gemini-3 succeeding in only 23% of hard games. The study reveals that world knowledge helps only up to a point; beyond that, planning and long-horizon reasoning are the limiting factors.

June 16, 2026