GEEK HAUS
Back to feed
2026/06/06/when-claude-changed-everything-changed-managing

When Claude changed, everything changed: Managing AI blast radius in production

·VentureBeat
read original

EDITOR BRIEF

A company’s internal reporting tool used Claude to convert plain-English data requests into structured JSON API calls for analysts and business teams. After earlier model upgrades worked smoothly, a later Claude change altered behavior enough to affect production, showing how fragile LLM contracts can be when downstream systems depend on exact outputs.

CONTEXT

The incident highlights a growing operational challenge: AI models are not static dependencies, and even “better” model versions can break workflows that rely on consistent formatting or interpretation. Teams using LLMs in production need stronger blast-radius controls, including schema validation, regression tests, staged rollouts, fallbacks, and monitoring for behavioral drift.

ARTICLE

Our system did one thing, and it did it well: It turned natural-language questions into API calls.The users were analysts, account managers, and operations leads. They knew what data they needed, but assembling it manually meant pulling from four dashboards, two BI tools, and a Salesforce report builder. With our system, they typed the request in plain English. A request like "Compile a report on sales volume for January through March 2026 for the Northeast region, broken down by city" was translated into an API call that the system could act on:json{  "description": "User requested sales volume for the given date range, here is the API call to get the response",  "api_call": "/api/sales_volume",  "post_body": {    "start_date": "2026-01-01",    "end_date": "2026-03-31",    "region": "northeast"  }}The rest of the pipeline was conventional engineering. The system dispatched the call to the right backend — we had integrations with internal reporting portals, Salesforce, and several homegrown services — applied a large language model (LLM)(-generated JSON query to filter and shape the response, and delivered it via email, as a Drive document, or rendered as a chart in the browser.By mid-2025, the system was generating several hundred reports a month. These reports were consumed by leadership and analysts and circulated to external stakeholders. It had become the default way most teams pulled ad-hoc data.The contract between the LLM and the rest of the system was a structured JSON object as described in the above example.json{  "description": "User requested sales volume for the given date range, here is the API call to get the response",  "api_call": "/api/sales_volume",  "post_body": {    "start_date": "2026-01-01",    "end_date": "2026-03-31",    "region": "northeast"  }}We built it on Claude Sonnet 3.5 in early 2025. We upgraded to 3.7 without incident, and to 4.0 without incident. By the time Sonnet 4.5 shipped, we had grown complacent about the stability and predictability of LLMs in solving what we believed was a simple problem. Model upgrades had become routine, like bumping a minor version of a well-behaved library.Then we rolled out 4.5. For a meaningful percentage of requests, the model began folding the contents of post_body into the description field. Two failure modes followed.First, the filter parameters never reached the API. Our system read post_body as the source of truth for the request payload, and that field came back empty. The API call was made without the date range or region filter. Depending on the specific API being called, the backend either returned sales volume for all time or all regions or returned a 500 error.Second, the model started asking clarifying questions in its response. This was new. Earlier versions always took a best-effort approach to an ambiguous request and returned a structured object. Sonnet 4.5, being more cautious, would sometimes respond with a question instead. Our system had no path for this. It had been built on the assumption that every model invocation would result in an API call. There was no human-in-the-loop component and no state to hold a partially completed request. This caused downstream systems to break in multiple ways.We rolled back to 4.0. That was harder than it should have been: Between the 4.0 and 4.5 deployments, our team had added new API integrations, all of which were qualified against 4.5. Reverting the model meant requalifying every one of them against 4.0 under time pressure.Why traditional engineering discipline fails hereSoftware engineering rests on the ability to bound the effect of a change. When you upgrade a driver or library, you read the release notes to see whether to expect breaking changes. Unit tests circumscribe what could possibly have moved. You can leverage the following property: The system being changed is deterministic enough that its behavior can be predicted, or at least sampled densely enough to give you confidence. The blast radius is bounded by construction.LLM-backed systems break this assumption. The component that produces your output is not under your control. You cannot diff a model version bump from 4.0 to 4.5. It is a wholesale replacement of the functionality on which your system depends.This is what we mean by an infinite blast radius: a change whose downstream effects cannot be enumerated in advance because the input space (natural language) and the failure modes (anything the model might do differently) are both unbounded.Anatomy of the failureThe post-mortem revealed that our prompt had always been under-specified. We had told the model to return a JSON object with three fields. We had described what each field was for. We did not explicitly state that the description must be a natural-language string and must not contain serialized representations of other fields.Earlier versions of the model inferred this constraint from context. Sonnet 4.5, evidently better at being "helpful" in its formatting

COMMENTS

Discussion

> geekhaus:~$ next read?

Next read recommendations