GEEK HAUS
피드로 돌아가기
2026/05/24/ai-agents-are-quietly-generating-chaos

AI agents are quietly generating chaos engineering failures enterprises don’t track yet

·VentureBeat
원문 보기

편집자 요약

본 기사는 기업 환경에 투입된 AI agent가 불완전한 context를 바탕으로 기술적으로는 타당한 작업을 수행하면서도, 인프라 연쇄 장애를 유발하는 새로운 위험을 지적합니다. 현재 많은 조직이 이를 agent 장애인지 인프라 장애인지 분류하지 못하고 있으며, chaos engineering과 autonomous agent 운영을 별개 영역으로 다루는 구조적 한계가 드러나고 있습니다.

맥락

AI agent가 운영 자동화의 실행 주체로 자리 잡을수록 장애 원인은 코드 결함보다 context, 권한, 의사결정 경계의 문제로 이동합니다. 기업은 agent 행동을 별도 애플리케이션 리스크가 아니라 운영 복원력 검증 대상에 포함하고, chaos engineering을 agent governance와 통합하는 방향으로 재설계해야 합니다.

본문

There is a category of production incident that engineering teams are not tracking yet — because it doesn't fit any existing postmortem template. The agent initiated an action. The action was technically correct given the agent's context. The context was incomplete. The infrastructure cascaded. And, by the time the incident review happened, three teams were arguing about whether it was an agent failure or an infrastructure failure,  because the frameworks for thinking about these two things have never been connected. The scale of this exposure is no longer theoretical. Seventy-nine percent of organizations now have some form of AI agent in production, with 96% planning expansion. Gartner predicts 33% of enterprise software will include agentic AI by 2028, but separately warns that 40% of those projects will be canceled due to poor risk controls. What neither statistic captures is the failure mode happening between those two numbers: Agents that are running, that are not canceled, and that are quietly generating infrastructure events no one has categorized as risk.I've spent six years building infrastructure automation systems at enterprise scale, first at Cisco (leading AI-driven lifecycle platforms deployed across 20-plus global enterprise customers), then at Splunk (designing AI-assisted root cause analysis and observability workflows across thousands of enterprise environments). During that time I also filed a patent on intent-based chaos engineering methodology. And across all of it, I kept watching organizations make the same structural mistake: Treating autonomous agents and chaos engineering as separate disciplines. They are not. They are the same discipline, and the gap between them is quietly generating the next wave of major production incidents.The judgment call that agents skipTo understand why this matters, you need to understand what's actually broken in how enterprises govern chaos today,  before you add agents to the picture.Most mature engineering organizations have invested in chaos engineering programs. Game days, blast radius controls, SLO-gated experiments. When a human engineer initiates a chaos experiment, the sequence has a critical property: A human is making a judgment call about whether the system has capacity to absorb the perturbation right now. They check dashboards. They look at the error budget burn rate. They assess whether dependencies are stable. It's imperfect and often intuitive, but there is at least a person in the loop asking the right question before anything runs.When you introduce an autonomous remediation agent,  one that can restart services, reroute traffic, scale resources, or modify configurations in response to detected anomalies,  that question disappears. The agent sees an anomaly. The agent takes an action. The action is a chaos event. No SLO burn rate check. No blast radius calculation. No human judgment about whether right now is the right moment to introduce additional stress into a system that may already be under pressure from three other directions.Here is the specific failure mode I have watched play out. A remediation agent detects elevated latency on a microservice and responds by restarting the service cluster; a reasonable action given its training data and its narrow view of the incident. What the agent doesn't know: Three other services are in the middle of handling peak traffic. The shared connection pool is already at 87% utilization. A dependent database is running a background index rebuild. The restart triggers a thundering herd against the recovering service. What started as a latency spike the agent was designed to fix becomes a cascade the agent was never designed to model. The blast radius of that agent action was not the service restart. It was everything downstream of the restart, in a system state the agent had no complete picture of.Nobody's chaos engineering program had tested for that specific combination. Nobody's blast radius calculation had included the agent as an actor. Because we don't think of agents as chaos injectors. We should. According to the AI Incidents Database, reported AI-related incidents rose 21% from 2024 to 2025. That count almost certainly understates the actual exposure, because most organizations have no incident classification that captures an autonomous agent action as the initiating cause of a cascade. The incident gets logged as a service restart, a connection pool saturation, or a latency event. The agent is invisible in the postmortem.Absorb capacity is a resource; most systems don't treat it that wayThe underlying problem is that enterprise systems have no shared language for absorb capacity — the real-time estimate of how much additional stress a system can take before it breaches its SLO commitments. Chaos engineering programs manage it implicitly, through human judgment and static thresholds that fire after a limit has already been crossed. Agent

댓글

토론

> geekhaus:~$ 다음 읽을거리?

다음 읽을거리 추천