Skip to content
Back to blog
Operating costsCost guide 2026

What does a running AI agent actually cost? Tokens, inference, the monthly bill

The build is one thing. What does the agent cost every month once it's live? We break the cost into tokens, inference, and infrastructure, show the formula, and the three levers that keep the bill in check.

Most conversations about AI agents stop at the build price. That matters, but it's incomplete. An agent, unlike a website or an app, costs money every time it works. The question that actually decides ROI is: what will you see on the model bill each month once the agent is live?

Three parts of a running agent's cost

  1. Inference (tokens). Every call to the model costs money, billed per input and output token. Usually the largest and most variable line.
  2. Infrastructure. Hosting, vector store, queues, monitoring. Relatively fixed and predictable.
  3. Maintenance. An engineer's time for evaluation, prompt fixes, edge cases. For us this is a separate, explicit retainer.

This article focuses on the first, because it's the one that surprises people.

How to count token cost

The formula is simple:

cost/month = number of queries × (input tokens + output tokens) × price per token

The devil is in "input tokens." If every call resends the full conversation history, a large system prompt, and ten context documents, a single query can run to tens of thousands of input tokens before the model even answers.

A worked example

An email-triage agent: 3,000 queries a month, each with a few thousand input tokens of context and a short answer. Depending on the model chosen, the inference bill lands roughly in these bands:

VolumeQueries/monthInference cost/month (illustrative)
Small process~1,000tens of USD
Medium process~3,000–10,000hundreds of USD
Large / multi-tool50,000+thousands of USD

The difference between "a cheaper model where it suffices" and "the strongest model for everything" is often a multiple of this bill. That's why model choice isn't a detail.

Who pays for inference

For us, inference is a separate, usage-based cost, independent of the build and maintenance fee. We usually run it on your own API accounts or in your cloud, so you see the real bill with no markup from us. The exact billing model is set in the proposal. The logic is simple: a cost that scales with volume should be transparent, not hidden in a flat fee.

Three levers that keep the bill in check

  1. Model routing per task. Classification or extraction don't need the most expensive model. We reserve the strong model for where reasoning accuracy matters.
  2. Context discipline. We don't resend the full history on every call. Fewer input tokens means a smaller bill and less quality drift.
  3. Cache and reuse. We cache repeated queries and stable context instead of paying for them again.

We track the effect as Cost Per Query, one of the metrics we report in maintenance. It turns the conversation from "it'll be fine" into "it costs X per query, and here's how we bring it down."

The decision threshold

The most important number isn't technical, it's commercial: the agent's cost per case versus the cost of the human it replaces or relieves. If the agent costs more per case than the person it was meant to relieve, the audit verdict is "don't build." We'd rather say that before code than after the budget is spent on a pilot that never made sense.

How this works with us

We estimate inference cost in the audit on your real volume and data. No "it depends" without a number behind it.

Have a process you want a real cost-per-query on? Book a 30-minute call.