Scribby Multi-Level LLM Framework Promises Fine-Grained Semantic Analysis of Long-Form Video

Researchers propose Scribby, an LLM-based framework for semantic video analysis that balances macro-level comprehension with micro-level semantic indexing. The approach analyzes full transcripts, individual sentences, and groups sentences by semantic similarity using an LLM as a judge, enabling more detailed understanding of video structure and thematic progression.

iGEN Editorial

June 16, 2026

Scribby Multi-Level LLM Framework Promises Fine-Grained Semantic Analysis of Long-Form Video

As video content continues to expand across educational platforms, recorded lectures, and live-streamed entertainment, the need for efficient and structured analysis of long-form footage has increased, according to a new arXiv preprint. However, many existing AI programs provide only high-level video summaries based on AI-generated transcripts, which are often limited to coarse overviews and lack detailed analysis of a video's structure, thematic progression, and semantic relationships.

Scribby: A Multi-Level LLM Framework aims to address this gap by proposing an LLM-based video summarization framework that balances macro-level comprehension with micro-level semantic analysis. The framework, detailed in the paper by Abelarde, Julian, Belinchon, and Hugo Garrido-Lestache, establishes a foundation for video analysis tools that visualize semantic chunking and semantic matching through relevance-based heatmaps.

Technical Approach: Micro-Level Indexing with LLM as Judge

The first stage of the Scribby process indexes the video at a micro level through three steps:

Analyzing the full transcript at a global level
Analyzing individual transcript sentences
Grouping these sentences by semantic similarity using an LLM as a judge

Contextual continuity is retained during sentence-level processing by incorporating both the global transcript analysis and adjacent sentence information into each evaluation prompt. This approach ensures that micro-level understanding is grounded in the broader narrative of the video.

Step	Description
1	Full transcript analysis (macro-level context)
2	Individual sentence analysis
3	Sentence grouping by semantic similarity via LLM-as-judge

The use of an LLM as a judge — where the language model evaluates semantic similarity — is a key innovation, allowing the framework to capture nuanced relationships between segments without requiring pre-defined categories.

Potential Applications and Limitations

The framework is designed for comprehensive video analysis, particularly for content where structural and thematic details matter, such as educational lectures or recorded presentations. The paper discusses limitations and future expansions of the framework, though specific application domains beyond general semantic analysis are not detailed in the source.

As the authors note, the work establishes a foundation for tools that can visualize semantic chunking and relevance-based heatmaps, pointing toward future interactive analytical interfaces. The paper is available on arXiv under a Creative Commons Zero license.

Sources:

Scribby Multi-Level LLM Framework Promises Fine-Grained Semantic Analysis of Long-Form Video

Technical Approach: Micro-Level Indexing with LLM as Judge

Potential Applications and Limitations

Recommended Stories

New Framework GeoVR Learns 3D Spatial Intelligence from 2D Videos for Multimodal LLMs

M*: A Modular, Extensible Serving System for Efficient Multimodal AI Inference

Wasserstein Equilibrium Decoding Boosts Reliability in Medical Visual Question Answering

Language-Guided AI Framework CLARITY Boosts Road Scene Segmentation for Autonomous Logistics