GEEK HAUS
Back to feed
2026/06/01/anthropic-s-browser-agent-got-hijacked-31-5-of

Anthropic’s browser agent got hijacked 31.5% of the time before safeguards engaged

·VentureBeat
read original

EDITOR BRIEF

Anthropic disclosed that red-teamers hijacked its browser-based agent 31.5% of the time before safeguards intervened, while OpenAI, Google, and Meta offered less comparable disclosures. The article argues that Anthropic’s high number may be valuable because it is one of the few concrete benchmarks buyers have for prompt injection risk.

CONTEXT

The bigger issue is not one lab’s failure rate but the absence of shared testing standards for agent security. As AI systems gain access to browsers, documents, and enterprise tools, buyers will increasingly demand comparable metrics before trusting agents with sensitive workflows.

ARTICLE

Across the frontier labs, the highest prompt injection figures published this spring are Anthropic’s. Point a red-teamer at its newest model in a browser, and the attacker hijacked it 31.5% of the time before safeguards engaged. OpenAI, Google, and Meta never gave security leaders a comparable number to set beside it. That figure looks like a liability. In this comparison, it is the opposite. It's the one solid piece of ground.Four frontier labs each shipped a prompt injection disclosure, and no two match. Anthropic put 244 pages and four agentic surfaces on the table on May 28. OpenAI reported one surface, connectors. Google moved the subject out of the model card and into a separate safety framework. Meta shipped no closed-model card at all. The Cross-Vendor Prompt Injection Disclosure Grid below maps what each lab tested, what each one measured, and the four places a side-by-side comparison falls apart.A prompt injection hides a malicious instruction in something an agent reads, a web page, a document, or a tool result. One planted line can exfiltrate records or fire off actions nobody approved, and these cards are a buyer's only first-party evidence.There is no industry standard for measuring any of this, and that is the root of the problem. Carter Rees, VP of AI at Reputation, told VentureBeat that prompt injection breaks the assumption that every legacy tool was built on. "A phrase as innocuous as, 'ignore previous instructions' can carry a payload as devastating as a buffer overflow, yet it shares no commonality with known malware signatures." With no shared signature to scan for, each lab built its own yardstick, and the results do not line up. Adam Meyers, Senior Vice President of Counter Adversary Operations at CrowdStrike, said that the exposure is now the buyer's to manage. "As you implement AI, it increases your attack surface, so now you have to be able to protect those AI models against adversary misuse or data poisoning or prompt injection." CrowdStrike's own frontline data shows the threat side is not standing still. In its 2026 Financial Services Threat Landscape Report, released in May, the company reported adversaries using AI to compress the time from initial access to impact faster than legacy defenses can respond.Anthropic measured four surfaces. The numbers swing by an order of magnitude depending on which one you read.The Opus 4.8 card does what others do not: It breaks prompt injection out by surface, and the spread is the story.Put the model in a coding environment, and an adaptive attacker from Gray Swan's Shade tool got through on 7.03% of single attempts with thinking on. Safeguards pulled that to 2.09%.Move the same class of attack into a browser, the surface behind Claude in Chrome and Claude Cowork, and the floor gives way. Anthropic put professional red-teamers on 129 web environments held out from training and printed every result in Table 5.2.2.4.A on page 81 of the system card. Per-attempt is the share of all injection attempts that got through across 129 environments at 10 tries each. Per-scenario is the harder cut, the share of environments where at least one try landed. Read down the per-attempt column without safeguards, thinking on, and the raw rate drops with each generation, from Sonnet 4.6 at 50.7% to Opus 4.8 at 31.5%. The lowest in the table, 5.9%, belongs to Mythos Preview, which nobody can buy yet. Turn safeguards on, and Opus 4.8 drops to 0.5%. Turn thinking off and it drops to zero across all 129 environments. OpenAI measured one surface, with attacks it already knew.The GPT-5.5 card, published April 23 and updated April 24, handles prompt injection in one place, a single section on robustness to known attacks against connectors. OpenAI reports it as a robustness score where higher is better, the inverse of an attack success rate. GPT-5.5 came in at 0.963, down from 0.998 for GPT-5.4-thinking. That one figure is the whole disclosure.Anthropic tested four surfaces against an adaptive attacker that rewrites its approach based on what the model does, then ran a one-week bug bounty where red-teamers tried to break the model live. When the coding results came back worse than Opus 4.7, the card said so.Lay the 0.963 next to the 31.5%, and they look like they belong on a scoreboard. They do not. One is a robustness score against known attacks on one surface. The other is a per-attempt attack success rate across 129 browser environments against an attacker that adapted in real time.Google and Meta never put the number in the card at allGoogle's Gemini 3 files prompt injection under mitigations, and the launch materials describe stronger resistance with no number attached. The Frontier Safety Framework report does run red teaming, but across its capability domains, and prompt injection is not one of them. No model card, no framework page, no per-surface number a buyer can lift into a risk review.Meta ships open weights with no

COMMENTS

Discussion

> geekhaus:~$ next read?

Next read recommendations