Most AI Agent Problems Are Infrastructure Problems

I think a lot of teams are optimizing the wrong layer.

When an AI workflow breaks, the first instinct is usually to swap the model. Try OpenAI. Try Anthropic. Try a new prompt. Try a new framework. Maybe that helps. Usually it doesn’t fix the real problem.

Most AI agent problems are infrastructure problems.

The model is the visible part, so it gets all the attention. But in practice, the failures usually happen in the seams. A tool call times out. A background job runs too long. A session loses state. A retry happens in the wrong place and duplicates work. One flaky dependency turns a clean demo into a system that quietly dies halfway through the job.

That stuff is not sexy, but it’s the whole game.

The demo works, the system doesn’t

A lot of AI products look good in a five minute demo because the happy path is easy to stage.

You give the model a clear instruction. It calls the right tool. The data comes back clean. The output looks smart. Everyone nods.

Then real usage starts.

Now inputs are messier. APIs are slower. Credentials expire. One tool returns malformed JSON. Another gives you a 429. A user asks for a task that takes 20 minutes instead of 20 seconds. Suddenly the question isn’t whether the model is smart. The question is whether the system can survive contact with reality.

That’s why I keep coming back to the same point: reliability matters more than cleverness.

If you want a useful AI system, I think you need a few boring things before you need a better prompt.

What actually matters

First, you need retries that aren’t stupid.

Not infinite retries. Not blind retries. Real retries with limits, backoff, and some awareness of what failed.

Second, you need state.

If a workflow has already finished steps one through four, it should not start over just because step five broke. It should know where it is, what already succeeded, and what still needs attention.

Third, you need supervision.

Long running work needs checkpoints. It needs status. It needs a way to surface, “here’s what happened, here’s what’s blocked, here’s what I’m doing next” without making the user babysit every move.

Fourth, you need graceful degradation.

If the ideal path fails, the system should still have a second move. Maybe it waits. Maybe it falls back. Maybe it asks for help at the right moment instead of crashing into a wall and pretending it completed the task.

None of this is glamorous. That’s exactly why it matters.

Model switching is not a strategy

I like good models. I’m happy to use better ones whenever they show up.

But “we’ll just switch models” is not an operating plan. It’s a coping mechanism.

If your system depends on every tool succeeding instantly, every API staying stable, and every run finishing on the first try, you’re not building an agent. You’re building a brittle chain of lucky events.

The teams that win here won’t just have access to strong models. Everyone will have that.

The teams that win will build the execution layer around the model. They’ll know how to recover work, route around failures, preserve context, and keep moving when the world gets noisy.

That’s the real moat.

It’s the same lesson I wrote about in Nobody Cares About Your AI. The flashy part gets attention. The useful part solves the problem.

The boring work is the product

I don’t think the future belongs to the teams with the most impressive demos.

I think it belongs to the teams that make AI feel dependable.

The ones that make a user trust that the work will finish. The ones that handle failure without turning the user into unpaid QA. The ones that treat recovery, observability, and orchestration like product features, because they are.

According to McKinsey, the economic upside of generative AI is massive. I buy that.

But I don’t think most of that value comes from demos that look magical on day one. I think it comes from systems that keep working on day one hundred.

That’s a much less glamorous story.

It’s also the one worth building.

#Others