Topic
synthetic data
SpecAlign Framework Uses Synthetic Data to Align Large Language Models with Specific Policies
A research paper introduces SpecAlign, a framework that generates synthetic training data from provider-authored model specifications to align large language models with specific policies. The method combines structured rule annotation, controllable instantiation, and multi-agent adversarial data synthesis to create preference pairs for fine-tuning. Experiments show improved rule compliance without sacrificing general capabilities.
New Auditing Framework Detects Synthetic Data Privacy Leaks Without Model Access
A new causal framework for auditing synthetic data detects privacy leaks by distinguishing true disclosures from phantom ones. It uses statistical hypothesis testing with holdout sets, requires no model access or canary insertion, and is orders of magnitude more efficient than shadow-model approaches.
StateGen Platform Generates Synthetic Training Data for Tool-Augmented LLMs with 9.66/10 Hallucination Score
Researchers introduce StateGen, a synthetic data generation platform that produces scored, reasoning-trace-rich training conversations for tool-augmented LLMs. The platform uses a four-role LLM loop and an authoritative state manager to eliminate tool-call hallucinations, achieving a 9.66/10 score across 64,698 evaluated conversations.