Automated extraction of data from web pages remains a critical yet resource-intensive task for enterprises that rely on information from multiple online sources. Manual scraper development often fails to scale across the vast and varied structures of modern HTML documents. A new framework called Co-Scraper, detailed in a paper published on arXiv, offers a solution by combining query-aware DOM pruning with stable extraction strategy induction to generate reusable programmatic wrappers.
How Co-Scraper Works
According to the paper, Co-Scraper is a two-stage framework designed to handle the hierarchical complexity of long HTML documents. In the first stage, it performs query-aware DOM pruning—removing irrelevant parts of the document object model based on the user's extraction query. This reduces noise and computational overhead. The second stage induces a stable extraction strategy, which is then synthesized into an executable scraper that can be reused across similar web pages. The entire process is powered by a fine-tuned Qwen3-8B model, a large language model optimized for code and data extraction tasks.
Performance and Validation
On the test set of the SWDE (Structured Web Data Extraction) benchmark, Co-Scraper achieved state-of-the-art performance with an F1 score of 94.78% and a reuse success rate of 90.39%. The researchers reported that this significantly enhances the accuracy and resilience of data extraction compared to prior methods. The reuse success rate indicates how often the generated scraper can be applied to other structurally similar pages without modification, a key metric for reducing manual maintenance.
Implications for Scalable Data Extraction
The framework addresses a key bottleneck in enterprise data acquisition: the need to create separate scrapers for each website while keeping them robust to minor layout changes. By generating reusable wrappers from a single example, Co-Scraper reduces the time and cost associated with manual scraper development. The lightweight nature of the extracted scrapers also minimizes computational resources, making them suitable for high-volume, low-latency extraction pipelines. While the paper does not name specific enterprise users, the approach is directly applicable to domains such as e-commerce price monitoring, news aggregation, and supply chain data collection from supplier portals.