The pager goes off and the room changes. I spent ten years on that side of the alert as a Site Reliability Engineer, the person who gets called when a system breaks and money burns by the minute. At Google I handled escalations, the failures too big for the first team to contain. My job was to direct the response and write the updates leadership actually reads. Short. Honest. No jargon, while the system was still down.
You learn fast what matters in that room. Nobody asks how elegant the architecture was. They ask when the system comes back and how we keep this from happening again. Every minute has a cost, and everyone in the room can feel it. Pressure like that is a curriculum. It teaches you that systems never fail politely. They fail at the worst hour, in the strangest way, with everyone watching.
Today I ship AI systems for businesses as a Fractional Chief of AI in Miami. Different decade, same physics. The model does not decide whether your AI survives contact with real customers. The discipline around the model decides. Reliability is a product feature. You build it on purpose or you do not have it.
Every demo works
Here is something every vendor knows and few will say out loud. Every AI demo works. The vendor picks the question, the data is clean, and the room wants to believe. Five minutes later, everyone does.
If you own the business, this is the moment of maximum danger. You just watched it work, so you are ready to wire it into sales, support, and the phones. But the demo answered one question, one time, on a good day. Your business asks ten thousand questions a week, and some of them arrive angry.
Production is a different country. A customer types something nobody predicted. An API times out in the middle of an answer. The model invents a policy you never wrote and offers a refund you never approved. Nobody trained it to do that. Nobody trained it not to.
I have lived on both sides of this gap. At Google I watched planet-scale systems humble brilliant engineers, and building voice agents I learned what 50,000 real calls teach you that no demo ever will. The distance between demo and production is not a step. It is the whole job.
Give your AI an error budget
SRE runs on a tool called the error budget. In plain words: you decide in advance how much failure you will accept, and you write it down. Google does not promise its systems will never fail. It promises a number, then manages to the number. That honesty is the foundation everything else stands on.
Your AI deserves the same deal. Before launch, answer three things. How often is this system allowed to be wrong? What counts as wrong, a clumsy sentence or a wrong price? And what happens the day it crosses the line?
Decide in advance how often the system is allowed to be wrong, and what happens when it is.
Pick a number you can defend. Out of every hundred conversations, how many misses can you absorb before the math turns against you? Write that number down where leadership can see it, because that one sentence does more for an AI program than any model upgrade.
Then plan the response. The agent can hand off to a human after two failed turns. A bad week can trigger a rollback, which means returning to the last version that worked. The feature can pause until you find the cause. Any of these work. The only wrong answer is deciding nothing.
Most teams launch on hope. Then the first ugly screenshot lands in a group chat, leadership panics, and the project dies in a day. Not because the system was bad. Because nobody agreed, in advance, on what bad meant. An error budget turns a crisis into a procedure.
Instrument it like production
At Google, nothing shipped blind. A service had logs, dashboards, and alerts before it had users. A log is a flight recorder: a record of every input and output, so you can replay exactly what happened. Most AI deployments I review have nothing like it. The agent talks to customers all day, and nobody can replay a single conversation.
The minimum bar is short. Log everything in and out. Build a fallback, the safe default the system drops to when it breaks, so the customer gets a graceful handoff instead of silence. Wire an escalation path, a clear route to a real human, so hard cases get caught early instead of compounding.
Put the system’s performance on a dashboard someone checks every week, because a silent failure is the most expensive kind. An agent that quietly mishandles one call in ten can run for a month before a human notices. By then the damage has a month of compound interest.
Then ask the pager question. When the agent tells a customer something wrong, who gets woken up? Name the person. If the answer is nobody, you do not have a production system. You have a demo with traffic.
This is an architecture decision, not an afterthought. I broke down how I structure production voice agents, and the honest ratio surprises people. The prompt is a third of the build. The rest is the plumbing that catches failure before the customer ever feels it.
Boring beats clever
The habits that keep planet-scale systems alive are not exciting. A rollback plan: before you change anything, know exactly how you will undo it. A canary release: send the new version to a small slice of traffic first, watch it, then widen. A blameless postmortem: after a failure, write down the cause and the fix, and never punish the person who reported it.
I watched the same law hold while doing machine learning at BMW and marketing science at Fashion Nova. Scale finds every weakness, and it never gets tired. Cleverness does not save you at scale. Process does.
So I carried those habits straight into the agent systems I run at iExcel. A new prompt version reaches a small slice of conversations before it reaches all of them. Every change ships with its undo attached. Every failure becomes a written lesson the next build inherits.
The no-blame part is not kindness. It is engineering. People who fear punishment hide problems, and hidden problems compound until they pick their own moment to surface. Google built that rule into its culture for a reason. I keep it because it works.
Clever is fragile. Boring survives, and surviving is the entire assignment. When someone shows me a brilliant AI system with no rollback plan, I do not see brilliance. I see an outage that has not happened yet.
What to demand before anyone ships AI for you
You do not need to become an engineer to hold this line. You need six questions and the will to ask them. Use them on any vendor, any agency, and any internal team that wants to put AI in front of your customers. I use them on myself.
- A written error budget. How often it can be wrong, and what happens when it is.
- Replayable logs. Every conversation recorded and reviewable, not a black box.
- A fallback. Exactly what the customer experiences when the system fails.
- A named human. Who takes the call at night when it breaks. A name, not a team.
- A rollback plan. How you get back to yesterday in minutes, not days.
- A postmortem habit. A written lesson after every failure, with no blame attached.
Good answers to all six mean you found an operator. Vague answers mean you found a demo. And if they call the questions overkill, they have never carried the pager. Keep looking.
I wrote about why a fractional Chief of AI beats a full-time hire right now, and this essay is the other half of that argument. Strategy gets you started. Discipline keeps you running.
If you read that list and started checking boxes for your own company, we should talk. I’m the Fractional Chief of AI for businesses that know they’re behind, and I take you from watching AI happen to running on it in 90 days, with this discipline built in from day one. Book a strategy call.