New AI Framework Co-Scraper Achieves 94.78% Accuracy for Web Data Extraction with Reusable Scrapers

Researchers introduced Co-Scraper, a two-stage framework for automated web data extraction that integrates query-aware DOM pruning with a fine-tuned Qwen3-8B model. On the SWDE test set, it achieved an F1 score of 94.78% and a reuse success rate of 90.39%, enabling lightweight, reusable scrapers for heterogeneous web content.

iGEN Editorial

June 16, 2026

New AI Framework Co-Scraper Achieves 94.78% Accuracy for Web Data Extraction with Reusable Scrapers

Automated extraction of data from web pages remains a critical yet resource-intensive task for enterprises that rely on information from multiple online sources. Manual scraper development often fails to scale across the vast and varied structures of modern HTML documents. A new framework called Co-Scraper, detailed in a paper published on arXiv, offers a solution by combining query-aware DOM pruning with stable extraction strategy induction to generate reusable programmatic wrappers.

How Co-Scraper Works

According to the paper, Co-Scraper is a two-stage framework designed to handle the hierarchical complexity of long HTML documents. In the first stage, it performs query-aware DOM pruning—removing irrelevant parts of the document object model based on the user's extraction query. This reduces noise and computational overhead. The second stage induces a stable extraction strategy, which is then synthesized into an executable scraper that can be reused across similar web pages. The entire process is powered by a fine-tuned Qwen3-8B model, a large language model optimized for code and data extraction tasks.

Performance and Validation

On the test set of the SWDE (Structured Web Data Extraction) benchmark, Co-Scraper achieved state-of-the-art performance with an F1 score of 94.78% and a reuse success rate of 90.39%. The researchers reported that this significantly enhances the accuracy and resilience of data extraction compared to prior methods. The reuse success rate indicates how often the generated scraper can be applied to other structurally similar pages without modification, a key metric for reducing manual maintenance.

Implications for Scalable Data Extraction

The framework addresses a key bottleneck in enterprise data acquisition: the need to create separate scrapers for each website while keeping them robust to minor layout changes. By generating reusable wrappers from a single example, Co-Scraper reduces the time and cost associated with manual scraper development. The lightweight nature of the extracted scrapers also minimizes computational resources, making them suitable for high-volume, low-latency extraction pipelines. While the paper does not name specific enterprise users, the approach is directly applicable to domains such as e-commerce price monitoring, news aggregation, and supply chain data collection from supplier portals.

Sources:

New AI Framework Co-Scraper Achieves 94.78% Accuracy for Web Data Extraction with Reusable Scrapers

How Co-Scraper Works

Performance and Validation

Implications for Scalable Data Extraction

Recommended Stories

Goldman Sachs Report: Which Jobs Face the Biggest AI Automation Risk

Geek+ robots from China automate UK retail warehouses for Tesco, Asda and Next

AI Is Coming for Accounts Receivable’s Busywork, Not Its Jobs, Says FreightTech CEO

Smart Home Gadgets That Boost Curb Appeal Without Sacrificing Style