iGEN
Visit IGEN World Explore IGEN Expo
EXPLORE UPGRADE PLANS
BREAKING
When RAG Hurts: Research Identifies Attention Distraction in Vision-Language AI Models and Proposes Mitigation Strait of Hormuz Reopening: Mine Clearance Delays Threaten Weeks-Long Recovery for Oil Shipping India’s REITs and InvITs May Attract Rs 11.6 Lakh Crore Investment by 2030, Avendus Report Says DualGauge: Automated Joint Security-Functionality Benchmarking of Specification-Only Code Generation by LLMs and Coding Agents Nimble SharePower: Modular Power Bank Lets You Share a Charge With a Friend OBCache Prunes KV Cache for Efficient Long-Context LLM Inference with Output-Aware Scoring 'Dangerous' AI Models: Enterprise Leaders Must Prepare for Broad Availability Air India Launches 'Basic Fare' Option Without Complimentary Meals on Select Domestic Flights New Survey Maps How Evidence Tracing and Execution Provenance Can Make LLM Agents Trustworthy New Unifying Lens for Learning to Hash Could Cut Memory Costs in Large-Scale Retrieval When RAG Hurts: Research Identifies Attention Distraction in Vision-Language AI Models and Proposes Mitigation Strait of Hormuz Reopening: Mine Clearance Delays Threaten Weeks-Long Recovery for Oil Shipping India’s REITs and InvITs May Attract Rs 11.6 Lakh Crore Investment by 2030, Avendus Report Says DualGauge: Automated Joint Security-Functionality Benchmarking of Specification-Only Code Generation by LLMs and Coding Agents Nimble SharePower: Modular Power Bank Lets You Share a Charge With a Friend OBCache Prunes KV Cache for Efficient Long-Context LLM Inference with Output-Aware Scoring 'Dangerous' AI Models: Enterprise Leaders Must Prepare for Broad Availability Air India Launches 'Basic Fare' Option Without Complimentary Meals on Select Domestic Flights New Survey Maps How Evidence Tracing and Execution Provenance Can Make LLM Agents Trustworthy New Unifying Lens for Learning to Hash Could Cut Memory Costs in Large-Scale Retrieval
Home ›› Technology ›› Software ›› New AI Framework Co-Scraper Achieves 94.78% Accuracy for Web Data Extraction with Reusable Scrapers

New AI Framework Co-Scraper Achieves 94.78% Accuracy for Web Data Extraction with Reusable Scrapers

Researchers introduced Co-Scraper, a two-stage framework for automated web data extraction that integrates query-aware DOM pruning with a fine-tuned Qwen3-8B model. On the SWDE test set, it achieved an F1 score of 94.78% and a reuse success rate of 90.39%, enabling lightweight, reusable scrapers for heterogeneous web content.

iG
iGEN Editorial
June 16, 2026
New AI Framework Co-Scraper Achieves 94.78% Accuracy for Web Data Extraction with Reusable Scrapers

Automated extraction of data from web pages remains a critical yet resource-intensive task for enterprises that rely on information from multiple online sources. Manual scraper development often fails to scale across the vast and varied structures of modern HTML documents. A new framework called Co-Scraper, detailed in a paper published on arXiv, offers a solution by combining query-aware DOM pruning with stable extraction strategy induction to generate reusable programmatic wrappers.

How Co-Scraper Works

According to the paper, Co-Scraper is a two-stage framework designed to handle the hierarchical complexity of long HTML documents. In the first stage, it performs query-aware DOM pruning—removing irrelevant parts of the document object model based on the user's extraction query. This reduces noise and computational overhead. The second stage induces a stable extraction strategy, which is then synthesized into an executable scraper that can be reused across similar web pages. The entire process is powered by a fine-tuned Qwen3-8B model, a large language model optimized for code and data extraction tasks.

Performance and Validation

On the test set of the SWDE (Structured Web Data Extraction) benchmark, Co-Scraper achieved state-of-the-art performance with an F1 score of 94.78% and a reuse success rate of 90.39%. The researchers reported that this significantly enhances the accuracy and resilience of data extraction compared to prior methods. The reuse success rate indicates how often the generated scraper can be applied to other structurally similar pages without modification, a key metric for reducing manual maintenance.

Implications for Scalable Data Extraction

The framework addresses a key bottleneck in enterprise data acquisition: the need to create separate scrapers for each website while keeping them robust to minor layout changes. By generating reusable wrappers from a single example, Co-Scraper reduces the time and cost associated with manual scraper development. The lightweight nature of the extracted scrapers also minimizes computational resources, making them suitable for high-volume, low-latency extraction pipelines. While the paper does not name specific enterprise users, the approach is directly applicable to domains such as e-commerce price monitoring, news aggregation, and supply chain data collection from supplier portals.


Sources:

Keep Reading

Recommended Stories

DualGauge: Automated Joint Security-Functionality Benchmarking of Specification-Only Code Generation by LLMs and Coding Agents Technology

DualGauge: Automated Joint Security-Functionality Benchmarking of Specification-Only Code Generation by LLMs and Coding Agents

Researchers present DualGauge, an automated framework for jointly evaluating correctness and security of code generated by LLMs from natural-language specifications. A benchmark of 307 tasks across three languages shows that even the strongest models achieve under 15% joint security-functionality success, while factors like scale and instruction tuning do not reliably improve outcomes. Three leading agentic coding systems also show no advantage over direct generation.

June 16, 2026
Meta's RADAR Automates Low-Risk Code Review, Cutting Review Time by 330% Technology

Meta's RADAR Automates Low-Risk Code Review, Cutting Review Time by 330%

Meta has deployed RADAR, a multi-funnel automated system that risk-stratifies code diffs to accelerate low-risk reviews. The system has reviewed over 535,000 diffs and landed 331,000+, reducing median time to close by over 330% and median review wall time by 35%, while achieving a production incident rate 1/50 that of non-RADAR diffs.

June 16, 2026
New Automated Quantization Framework AQ4SViT Compresses Spiking Vision Transformers for Embedded AI Technology

New Automated Quantization Framework AQ4SViT Compresses Spiking Vision Transformers for Embedded AI

Researchers propose AQ4SViT, an automated quantization framework for Spiking Vision Transformers that uses a search gating policy to find optimal compression settings. It offers two variants: Greedy search for speed and Beam search for deeper compression. Experimental results on ImageNet show up to 6.6x faster search time and up to 90% memory savings while maintaining accuracy within 1.5% of the original model.

June 16, 2026
LabOSBench: New Benchmark Tests AI Agents on Complex Scientific Instrument Control Technology

LabOSBench: New Benchmark Tests AI Agents on Complex Scientific Instrument Control

LabOSBench is a new benchmark designed to evaluate computer-use agents on scientific instrument control. It features 96 subtasks across eight simulated instruments, testing agents on sample loading, alignment, parameter tuning, data acquisition, and result inspection. Early results show that while agents handle structured GUI tasks well, they struggle with feedback-driven operations and long-horizon workflows.

June 16, 2026