As video content continues to expand across educational platforms, recorded lectures, and live-streamed entertainment, the need for efficient and structured analysis of long-form footage has increased, according to a new arXiv preprint. However, many existing AI programs provide only high-level video summaries based on AI-generated transcripts, which are often limited to coarse overviews and lack detailed analysis of a video's structure, thematic progression, and semantic relationships.
Scribby: A Multi-Level LLM Framework aims to address this gap by proposing an LLM-based video summarization framework that balances macro-level comprehension with micro-level semantic analysis. The framework, detailed in the paper by Abelarde, Julian, Belinchon, and Hugo Garrido-Lestache, establishes a foundation for video analysis tools that visualize semantic chunking and semantic matching through relevance-based heatmaps.
Technical Approach: Micro-Level Indexing with LLM as Judge
The first stage of the Scribby process indexes the video at a micro level through three steps:
- Analyzing the full transcript at a global level
- Analyzing individual transcript sentences
- Grouping these sentences by semantic similarity using an LLM as a judge
Contextual continuity is retained during sentence-level processing by incorporating both the global transcript analysis and adjacent sentence information into each evaluation prompt. This approach ensures that micro-level understanding is grounded in the broader narrative of the video.
| Step | Description |
|---|---|
| 1 | Full transcript analysis (macro-level context) |
| 2 | Individual sentence analysis |
| 3 | Sentence grouping by semantic similarity via LLM-as-judge |
The use of an LLM as a judge — where the language model evaluates semantic similarity — is a key innovation, allowing the framework to capture nuanced relationships between segments without requiring pre-defined categories.
Potential Applications and Limitations
The framework is designed for comprehensive video analysis, particularly for content where structural and thematic details matter, such as educational lectures or recorded presentations. The paper discusses limitations and future expansions of the framework, though specific application domains beyond general semantic analysis are not detailed in the source.
As the authors note, the work establishes a foundation for tools that can visualize semantic chunking and relevance-based heatmaps, pointing toward future interactive analytical interfaces. The paper is available on arXiv under a Creative Commons Zero license.