When Claude changed, everything changed: Managing AI blast radius in production
편집자 요약
본 기사는 자연어 요청을 구조화된 JSON API 호출로 변환해 월 수백 건의 보고서를 생성하던 내부 시스템이 Claude 모델 변경에 의존하면서 겪은 운영 리스크를 다룹니다. Claude Sonnet 3.5에서 3.7, 4.0으로의 업그레이드는 무난했지만, 이후 변화는 LLM 출력 계약을 프로덕션 시스템의 핵심 의존성으로 다뤄야 한다는 점을 드러냈습니다.
맥락
이 사례는 LLM 기반 자동화가 단순 기능을 넘어 업무 흐름의 기본 인프라가 될수록 모델 변경의 blast radius를 제한하는 설계가 필수임을 보여줍니다. 스키마 검증, 회귀 테스트, 버전 고정, 단계적 배포, 폴백 경로 같은 전통적 엔지니어링 통제가 AI 시스템 운영의 표준 관행으로 자리 잡고 있습니다.
본문
Our system did one thing, and it did it well: It turned natural-language questions into API calls.The users were analysts, account managers, and operations leads. They knew what data they needed, but assembling it manually meant pulling from four dashboards, two BI tools, and a Salesforce report builder. With our system, they typed the request in plain English. A request like "Compile a report on sales volume for January through March 2026 for the Northeast region, broken down by city" was translated into an API call that the system could act on:json{ "description": "User requested sales volume for the given date range, here is the API call to get the response", "api_call": "/api/sales_volume", "post_body": { "start_date": "2026-01-01", "end_date": "2026-03-31", "region": "northeast" }}The rest of the pipeline was conventional engineering. The system dispatched the call to the right backend — we had integrations with internal reporting portals, Salesforce, and several homegrown services — applied a large language model (LLM)(-generated JSON query to filter and shape the response, and delivered it via email, as a Drive document, or rendered as a chart in the browser.By mid-2025, the system was generating several hundred reports a month. These reports were consumed by leadership and analysts and circulated to external stakeholders. It had become the default way most teams pulled ad-hoc data.The contract between the LLM and the rest of the system was a structured JSON object as described in the above example.json{ "description": "User requested sales volume for the given date range, here is the API call to get the response", "api_call": "/api/sales_volume", "post_body": { "start_date": "2026-01-01", "end_date": "2026-03-31", "region": "northeast" }}We built it on Claude Sonnet 3.5 in early 2025. We upgraded to 3.7 without incident, and to 4.0 without incident. By the time Sonnet 4.5 shipped, we had grown complacent about the stability and predictability of LLMs in solving what we believed was a simple problem. Model upgrades had become routine, like bumping a minor version of a well-behaved library.Then we rolled out 4.5. For a meaningful percentage of requests, the model began folding the contents of post_body into the description field. Two failure modes followed.First, the filter parameters never reached the API. Our system read post_body as the source of truth for the request payload, and that field came back empty. The API call was made without the date range or region filter. Depending on the specific API being called, the backend either returned sales volume for all time or all regions or returned a 500 error.Second, the model started asking clarifying questions in its response. This was new. Earlier versions always took a best-effort approach to an ambiguous request and returned a structured object. Sonnet 4.5, being more cautious, would sometimes respond with a question instead. Our system had no path for this. It had been built on the assumption that every model invocation would result in an API call. There was no human-in-the-loop component and no state to hold a partially completed request. This caused downstream systems to break in multiple ways.We rolled back to 4.0. That was harder than it should have been: Between the 4.0 and 4.5 deployments, our team had added new API integrations, all of which were qualified against 4.5. Reverting the model meant requalifying every one of them against 4.0 under time pressure.Why traditional engineering discipline fails hereSoftware engineering rests on the ability to bound the effect of a change. When you upgrade a driver or library, you read the release notes to see whether to expect breaking changes. Unit tests circumscribe what could possibly have moved. You can leverage the following property: The system being changed is deterministic enough that its behavior can be predicted, or at least sampled densely enough to give you confidence. The blast radius is bounded by construction.LLM-backed systems break this assumption. The component that produces your output is not under your control. You cannot diff a model version bump from 4.0 to 4.5. It is a wholesale replacement of the functionality on which your system depends.This is what we mean by an infinite blast radius: a change whose downstream effects cannot be enumerated in advance because the input space (natural language) and the failure modes (anything the model might do differently) are both unbounded.Anatomy of the failureThe post-mortem revealed that our prompt had always been under-specified. We had told the model to return a JSON object with three fields. We had described what each field was for. We did not explicitly state that the description must be a natural-language string and must not contain serialized representations of other fields.Earlier versions of the model inferred this constraint from context. Sonnet 4.5, evidently better at being "helpful" in its formatting
댓글
토론
다음 읽을거리 추천
Microsoft AI chief says company was “set free” from OpenAI to pursue superintelligence

Microsoft's AI Futurist explains how he uses Copilot — and the real-world problems enterprises are solving with agents
