Why prompt debt, retrieval debt, and evaluation debt are quietly reshaping enterprise AI risk
편집자 요약
본 기사는 AI 시스템의 실패가 코드 결함을 넘어 prompt, model, data pipeline, 평가 체계 전반에 분산되며 새로운 형태의 AI debt를 만들고 있다고 설명합니다. MIT와 S&P Global Market Intelligence 조사에 따르면 2025년 다수의 AI 프로젝트가 production 전환이나 가치 창출에 실패했으며, 원인으로는 복잡한 의존성과 관측하기 어려운 장애 지점이 지목됩니다.
맥락
전통적 technical debt는 비교적 재현 가능한 코드 문제였지만, AI debt는 확률적 동작과 데이터·모델 의존성 때문에 배포 이후에도 지속적인 모니터링이 필요합니다. 특히 evaluation debt가 누적되면 성능 저하와 drift를 조기에 포착하기 어려워져, 기업 AI 거버넌스의 초점이 개발 속도에서 운영 신뢰성으로 이동할 가능성이 큽니다.
본문
Over the past two decades, technical debt meant outdated architecture, messy code, and poorly maintained documentation. That definition is no longer sufficient in the AI era, where failure modes are more subtle and often non-linear. AI systems are introducing new layers of technical debt that live across prompts, models, and data dependencies — making these layers less visible, harder to measure, and often more dangerous than traditional debt.A crisis hiding in plain sightThe complexities of AI systems and their associated failures have been well documented. A 2025 MIT study found that 95% of AI projects fail to reach production or deliver value. A similar study by S&P Global Market Intelligence found that 42% of businesses scrapped multiple AI initiatives in 2025 — a sharp increase from 17% the previous year. Various reasons are cited for these failures, but most of them point to poorly designed and implemented systems that are complex to manage and have multiple hard-to-monitor failure points, leading to a rapid accumulation of AI debt. Traditional technical debt was localized to the codebase, and bugs were usually easily reproducible. Consequently, bugs could be easily identified during tests and fixed through rearchitecting the codebase. However, AI debt is much more distributed, manifesting across prompts, models, data pipelines, and all associated infrastructure. It is also more intermittent: Due to the probabilistic nature of AI, systems do not always respond the same way, leading to intermittent failures. This makes it much more challenging to identify risks during testing, and also creates a need for more continuous monitoring even post-deployment to prevent gradual drift and worsening performance.The new forms of AI debtAI debt typically manifests across four new forms, each of which comes with its own set of risks.Prompt debt is the most visible of these. A modern version of ‘spaghetti code,' this can include undocumented prompt tweaks, accumulated ‘quick-fix’ prompts that lead to inconsistencies, neglected version control of prompts, and ‘prompt stuffing’ (the cramming of extraneous data or context directly into AI prompts). All these combine to make prompts a form of untyped, untested code without any version control, leading to increased brittleness and vulnerabilities.Model dependency debt is another increasingly common form of AI debt. Most enterprises now depend on a mixture of external models developed by leading foundation model providers; applications and agents are built on top of API calls to these models. Consequently, application logic now depends on models that are external to the core system, and that cannot be clearly controlled. As models update, performance varies and reproducibility is lost — prompts tuned for one model may fail or perform poorly when switched to another model, whether an update from the same provider or from another provider.Most enterprise AI deployments today use retrieval-augmented generation (RAG), which pulls in additional context from enterprise data repositories. Retrieval debt is a consequence of these repositories having messy data, duplicated documents, and outdated information. This causes AI to return technically correct answers that are outdated and no longer relevant, causing downstream failures. Unlike hallucinations, these are harder to detect because they were correct, perhaps even until recently, and hence look correct to any tester. Evaluation debt reflects the lack of standardization in testing and monitoring for AI models and applications. While AI benchmarks exist, they tend to focus on narrow tests and reflect point-in-time results. Most enterprises lack consistent testing standards, ground truth datasets, and real-time monitoring of deployments; there is no equivalent yet of continuous integration /continuous delivery (CI/CD) for prompts. As a consequence, CIOs and CTOs do not have clear visibility into model performance and cannot track improvements or worsening of models. All of these are in addition to traditional forms of technical debt, which still manifest across the tools and systems that AI applications and agents interact with, read from, or write to. A rapid increase in the adoption of AI-generated code (often deployed without inadequate testing) is further aggravating inconsistencies within, and poor maintainability of traditional codebases. The new forms of AI debt combine with these earlier forms of technical debt to compound rapidly and create large-scale risks that can cause catastrophic failure of entire enterprise deployments. Solving for these risks is made even more challenging by the distributed nature of AI ownership – most systems span engineering, product, data, and business teams, leading to unclear accountability when an error is identified. As a result, these risks manifest in the form of escalating compute costs, inaccuracies in AI outputs, and increasing exceptions that need to be handled by humans — lea


