New OSGuard Benchmark Evaluates Safety of Computer-Use Agents for Enterprise AI Deployment

Researchers introduce OSGuard, a benchmark suite for evaluating safety in computer-use agents. It includes action-level guardrail decisions and a risk-augmented execution suite to detect unsafe completions that satisfy nominal task objectives. Early tests show current multimodal guardrails perform well on isolated action judgments but reveal gaps in end-to-end safety.

iGEN Editorial

June 16, 2026

New OSGuard Benchmark Evaluates Safety of Computer-Use Agents for Enterprise AI Deployment

Enterprise technology leaders deploying AI agents for desktop and web tasks face a critical blind spot: an agent may complete a task successfully but through an unsafe shortcut that violates security or operational policies. A new research benchmark, OSGuard, directly addresses this gap by providing a structured way to test the safety of computer-use agents under benign, unchanged user instructions.

The benchmark is detailed in a paper by researchers Mohammadmirzaei, Mina, Flanigan, and Jeffrey, published on arXiv. OSGuard stands for Operating System Guard, and it offers a dual-granularity approach to safety evaluation: an action-level benchmark for local guardrail decisions and a risk-augmented execution suite for end-to-end evaluation.

The action-level benchmark contains contextualized proposed actions labeled as allowed, unrelated, or unsafe. Each action is judged relative to the original instruction and current interface state, allowing precise testing of whether an agent can recognize prohibited steps. The execution suite, derived from OSWorld task variants, introduces latent hazards such as destructive overwrites while keeping the original task achievable. Each variant comes with augmented evaluators that retain the original task-success criterion but add explicit state-based safety invariants. This design lets evaluators distinguish safe completions from unsafe ones that merely satisfy the nominal task objective.

Benchmark Component	Focus	Key Metrics
Action-level	Local guardrail decisions	Correct classification of allowed/unrelated/unsafe actions
Execution suite	End-to-end task safety	Task success rate + state-based safety invariant violations

Experimental results on OSGuard show that current multimodal guardrails can perform well on isolated action judgments, but the risk-augmented execution exposes remaining gaps between local oversight and reliable end-to-end safety. This suggests that even if an agent correctly identifies an unsafe action in isolation, it may still take unsafe shortcuts during full task execution.

The dual-granularity design enables more precise diagnosis of whether models can both recognize unsafe proposed actions and improve full-task safety when deployed as guardrails. For enterprise adopters, this distinction is critical because a guardrail that works in a test environment may not prevent costly or dangerous behavior in production workflows.

As AI agents become more common in enterprise software—handling tasks like data entry, document processing, or system administration—the ability to formally benchmark safety becomes a procurement requirement. OSGuard provides a framework that technology buyers can reference when evaluating vendors' claims about agent safety. The benchmark's grounding in real desktop and web tasks makes it directly relevant for enterprise use cases.

Future work will need to expand the range of hazards and task complexity. For now, OSGuard serves as a foundational tool for distinguishing between agents that merely complete tasks and those that do so safely, a difference that can mean the difference between operational gain and regulatory or security incident.

Sources:

New OSGuard Benchmark Evaluates Safety of Computer-Use Agents for Enterprise AI Deployment

Recommended Stories

TERMS-Bench Diagnoses LLM Negotiation Agents Beyond Deal Rate for Enterprise Procurement

BRITE Benchmark Reveals Critical Gaps in Text-to-Video Models' Object-Action Binding and Audio-Visual Sync

KILLBENCH: New Benchmark Tests External Kill Switches to Stop Malicious AI

SkillsBench Benchmark Measures How Agent Skills Boost LLM Performance Across Diverse Tasks