As AI-powered communication tools become more prevalent, the ability to detect manipulation and persuasion in digital interactions—especially those involving teenagers—has become a critical concern. The newly released IMPACTeen dataset, described in a paper on arXiv, provides a structured resource for training and evaluating language models on these subtle yet consequential social dynamics.
The dataset contains 1,021 texts covering social influence scenarios across interpersonal, media-based, and digital settings in an adolescent context. It includes 5,100 individual annotation records with gold labels for social influence techniques.
Multi-Perspective Annotation
A key feature of IMPACTeen is its five-perspective annotation approach. Each text was independently annotated by representatives from five distinct groups:
| Perspective | Role |
|---|---|
| Teenagers | Provide youth-centric interpretation |
| Parents | Offer familial context |
| Psychologists | Assess psychological impact |
| Communication Experts | Analyze rhetorical strategies |
| Teachers | Evaluate educational implications |
This multi-dimensional annotation covers influence presence, techniques, intentions, consequences, resistance, reactions, and annotation confidence. The diversity of perspectives allows researchers to study annotator disagreement and its implications for model training.
Construction and Validation
The dataset was built through constrained LLM generation, followed by a two-step human editing and validation phase aimed at ensuring youth-context realism. According to the paper, this process was designed to produce texts that authentically reflect real adolescent communication patterns.
The resource was created in Polish and is accompanied by a corresponding English version, supporting cross-lingual modeling research.
Potential Applications
IMPACTeen supports research in several areas critical to enterprise AI systems:
- Social influence detection: Training models to identify when a message is attempting to persuade or manipulate.
- Language model safety: Evaluating whether LLMs generate or amplify manipulative language.
- Annotator disagreement analysis: Understanding how different stakeholders perceive the same communication.
- Cross-lingual modeling: Adapting detection systems across languages.
For enterprise technology decision-makers, the dataset offers a benchmark for building safer conversational AI—particularly in applications involving minors or sensitive communication channels. By grounding model behavior in validated human judgments across multiple expert and non-expert perspectives, IMPACTeen helps bridge the gap between technical performance and real-world ethical considerations.
The authors—Szczęsny, Aleksander; Mieleszczenko-Kowszewicz, Wiktoria; Markiewicz, Maciej; Bajcar, Beata; Adamczyk, Tomasz; Babiak, Jolanta; Chodak, Grzegorz; and Kazienko, Przemysław—have released the dataset under a Creative Commons Zero license, enabling broad reuse for academic and commercial research.