What is AI drift?, in software projects

Two drifts, one wordCopy link

“AI drift” can mean several different things depending on where you’re standing and what you were reading last. The MLOps sense is one family: model drift, prompt drift, LLM drift, all of them about how a trained system or a running prompt shifts over time. That family is real, well-covered elsewhere, and treated at the bottom of this page for anyone who arrived by that route.

The two senses this page treats are different again. Zoom out to the codebase, and “AI drift” is the slower phenomenon: what AI-generated code makes visible about a team’s engineering discipline over weeks and months. Zoom in to a single working session, and it’s a faster one: the model’s working understanding of the task wandering away from the spec as a thread runs long. Both get called “AI drift.” Neither is the other.

This page covers codebase-scale drift first, because it’s the slower-moving and better-understood of the two, then session-scale drift, which is newer and less named.

one word · two scales

Codebase-scale driftCopy link

What it isCopy link

Codebase-scale drift is what becomes visible when a team takes the speed AI gives it, and the engineering practice around the typing turns out to have been holding the discipline together. Code lands faster than anyone can reason about it. Schema decisions, edge-case handling, security posture, contract boundaries: all of these still get made, but the humans on the team stop making them. The agent makes a plausible choice in passing, the code compiles, the tests it wrote for itself go green, and the team ships. Six weeks later, nobody can answer simple questions about why the system behaves the way it does, because nobody decided. The agent did.

Worth being clear about cause and effect. The team was always going to drift if the discipline wasn’t in place; AI didn’t introduce that possibility. What AI did was strip away the natural friction that kept the drift slow enough to ignore. Pre-AI, the typing was the bottleneck, and the bottleneck forced a pace the team could keep up with. Post-AI, the typing is no longer the bottleneck, and any weakness in the practice now compounds at the speed of the agent. The drift was always latent. AI made it loud.

The distinction matters because it widens the addressable problem. Any team with weak engineering discipline is at risk, not just AI-heavy ones. The fix isn’t “use less AI.” The fix is the discipline that was always supposed to be there, finally being given room to do its job.

Why it surfaces nowCopy link

Codebase-scale drift is what becomes visible when the gap between demo-ready and production-ready stops being hidden by the cost of typing. AI made the cheap parts of building software cheaper still, and left the hard parts exactly as hard as they were. Teams that didn’t notice kept pouring their effort into the typing, because the typing was what felt like work for their entire careers. The agent now does the typing. The team is still organising around it. The discipline that used to live alongside the typing (the requirement that got argued through, the AC that got written before the code, the review that caught the bad assumption) doesn’t automatically follow the typing into the new arrangement. If that discipline held real weight in the old setup, it’s now missing. If it didn’t, AI just made the absence visible.

Either way, the same observable result: teams that skip the discipline ship twice the code with half the understanding, and the gap compounds with every feature added on top. The compounding isn’t the agent’s fault. The agent is doing what it was asked to do at the speed it was asked to do it. The team is the layer that’s supposed to decide what gets built and check that it’s right, and that layer is what AI exposes.

What it looks like in a codebaseCopy link

Drift shows up in patterns. None of them is new. The novelty is the speed at which they accumulate now that the engineering friction has been stripped away.

Schema drift under autocomplete. The agent adds a new column, a new field, a new payload shape, because something downstream needed it. The change wasn’t designed. It was inferred from surrounding code and patched in. Three weeks later, four other services have started reading the new field with three slightly different interpretations of what it means. No PR review caught it because each change was small and locally plausible. The data model now contradicts itself in production, and reconstructing the original intent is archaeology.

Silently invented edge cases. The agent encountered an ambiguous input and made a choice. Empty string treated as null. Negative numbers clamped to zero. Unicode normalised one way for storage, another way for display. None of these are wrong in isolation. None of them was decided by a person. Six months later, when a customer reports a bug, the team discovers that their product has been making policy decisions for half a year, and nobody remembers what those policies are.

Tests that prove nothing. The agent wrote tests alongside the code. The tests are green. The tests assert what the code does, which is not the same thing as asserting what the code should do. An AI-written test against AI-written code is the agent marking its own homework. The CI suite hums; the product behaves badly. This is what acceptance criteria as the contract was always for, and what AI-assisted teams without that discipline keep missing.

Requirements that were never written down. A product owner described a feature in a Slack thread. The agent built it. The feature exists in the codebase, in the test suite, and in the heads of two engineers. It does not exist in any document that survives them. When the team changes, the feature’s reason for existing goes with the people who remember the Slack thread. Six months from now, somebody will argue the feature should be removed because it doesn’t look important, and nobody will be able to prove otherwise.

How to prevent itCopy link

You prevent codebase-scale drift by putting the discipline back in place that the typing used to keep visible. The agent writes the code. The team writes the requirements, the acceptance criteria, the contracts, and the chain that ties them together. The cycle is mechanical, not heroic.

Three pieces do most of the work. The first is traceability: every line of code traces to a test, every test to an acceptance criterion, every criterion to a story and a requirement. When the chain is in place, drift becomes visible. A piece of code with no AC behind it is a flag. An AC with no test is a flag. A test with no AC is decorative. The discipline catches drift early, when it’s still a small correction.

The second is the build cycle: five stages per spec, each one committing, none of them skippable. The cycle is what stops the agent shipping work that was never specified and never reviewed honestly. It also gives the team a regular cadence for noticing when the agent has wandered, because the Review stage is a separate stage, not a thing that happens in the same breath as the Build.

The third is acceptance criteria as the contract: the AC, written before the code, becomes the contract the test enforces and the agent works against. The agent can’t mark its own homework if the homework was set by the team and the marking is done against a test the team owns. The other two pieces depend on this one; without it, they have nothing to hold on to.

None of this is new advice. It’s what good engineering teams have always done, when they were allowed to. The new thing is that the activity around it has collapsed, and the discipline is now the whole job. Codebase-scale drift is what becomes visible when a team takes the speed and leaves its engineering practice unchanged. Methodology is what makes the discipline survive the speed.

Coding-session-scale driftCopy link

Session-scale drift is a different animal, and it moves much faster. Codebase-scale drift takes weeks to become visible; this one can happen inside a single afternoon. A long thread with an agent fills its context window as it runs, and the working understanding of the task wanders from wherever it started. References the agent held early in the thread lose weight as the window fills. The model starts inventing what “we agreed” earlier in the conversation, or drifts past acceptance criteria that were plainly in scope on turn one.

It shows up in a small set of recognisable patterns. Drift from the spec. The eighth reply in the thread is solving a slightly different problem from the one posed in the first. No single turn was wrong; each one nudged the target a little. Reference decay. The constraint you set at the top of the thread three hours ago isn’t in the model’s working set any more, and the code it just wrote doesn’t honour it. Invented agreement. The model says “as we discussed” about something the two of you never discussed; the summary it’s carrying of its own thread has started filling gaps with plausible invention. Silent scope creep. The fix now touches three modules the ticket never mentioned, and nobody flagged the moment it happened.

The fix here is session hygiene, not team discipline, and the two don’t transfer. Short, focused threads that end and get replaced, rather than one thread that runs all day. Hard edges pinned in the prompt: the requirement, the AC, the trace back to it, cited explicitly rather than paraphrased from memory. The contract the AC set at turn one is the contract the reply at turn eight has to satisfy; re-cite it every few turns, don’t trust the model to still be carrying it. Check the output against the AC before it ships, not against how confident the last reply sounded.

RCF’s chain is what supplies the anchor at this scale too. An acceptance criterion is a fixed point the agent’s output can be checked against regardless of how long the thread has run, and the further a requirement traces back, the less it depends on what any one session happened to retain. See acceptance criteria as the contract and traceability.

The mechanisms behind session-scale drift are covered in more detail in Context engineering. The model isn’t wrong. It’s weighting the wrong thing., which names four ways a session’s context window goes bad: density, similarity, drift, and pollution.

AI drift versus model drift, prompt drift, and LLM drift

“Drift” is a loaded word in 2026, and most of the search traffic around it lands on a different problem again. Worth being explicit about which is which, even after the two scales above.

Model drift is the trained-model performance problem. A classifier trained on 2024 data starts misclassifying 2026 inputs because the world moved. The fix is retraining, monitoring, and the MLOps toolchain. Owners: ML engineers, data scientists.

Prompt drift and LLM drift are the agentic-system variants: the same prompt or the same model behaving differently across runs or across model versions, with downstream effects on agent reliability. The fix is evals, observability, and version-pinning. Owners: ML platform teams, agentic-systems engineers.

AI drift, in the two senses this page uses, is the team-and-codebase problem and the single-session problem. Discipline weakness made loud by speed, at one scale; a working thread coming loose from its spec, at the other. The fix in both cases is methodology rather than monitoring: the chain, the cycle, the contract, and, at the session scale, the hygiene that keeps a thread anchored. Owners: the engineering organisation, the tech leads, the heads of engineering, and the people running the session day to day. The tools are documents and reviews, not dashboards.

All of them describe real phenomena. They share a word because the underlying intuition (something that worked is no longer working, and the deviation accumulates) is the same. They share almost nothing else. If you arrived looking for the MLOps version, the canonical references live with the major MLOps vendors and the model providers’ own agent-engineering write-ups. If you arrived looking for either of the two senses this page treats, the rest of the RCF methodology is what this page leads to.