Cost Intelligence
Cost intelligence turns every governed AI request into a small cost ledger. Each trace records what model ran, how many tokens it used, how long it took, what it cost, and whether that cost made sense for the work being done.
This is not a billing processor and it does not charge your customers. During closed beta, treat Olyx cost data as an operational estimate based on the model rates you configure, similar to how you would use observability data to understand infrastructure spend before reconciling the final provider invoice.
Cost tracking is only accurate after you define token rates in the Model Registry. On every model call, Olyx multiplies input and output tokens by the rates you configured, then sums those figures across all steps in the trace. That gives you spend at the step, trace, model, project, and infrastructure levels without adding separate instrumentation to every call site.
The Cost Objects
If you are new to LLM cost tracking, start with these objects. They are the cost equivalent of a payment ledger: small, explicit records that can be inspected after the request finishes.
| Object | What it means | Where you see it |
|---|---|---|
| Model rate | The input and output token price for one model. You define it. | Model Registry |
| Step | One unit of work inside a trace, usually validation, routing, model execution, or tool handling. | steps[] |
| Run step | A step that actually calls a model and can carry token cost. | steps[].cost |
| Trace | One user-visible action, such as “summarize this PDF” or “answer this chat turn.” | summary.total_cost |
| Revenue | Optional amount you earned for the request, supplied by your app. | summary.revenue |
| Gross margin | revenue - total_cost; useful for unit economics, not invoicing. | summary.gross_margin |
| Grade | A letter signal that compares cost and latency to expected behavior. | summary.grades |
The important rule: cost belongs to traces, not pages or users by default. Attach metadata such as user_id,
feature, plan, or tenant_id when you create the trace if you want to roll costs up later.
Setup Flow
For a first integration, wire costs in this order:
- Register each model in the Model Registry.
- Add input and output token rates for each model you want to track.
- Create a trace for the user action.
- Execute one or more model calls against that trace.
- Complete the trace so Olyx can calculate grades.
- Read
summary.total_cost,summary.grades, andsummary.by_modelfrom the completion response.
const trace = await client.traces.create({
metadata: {
userId: "u_123",
feature: "support_summarizer",
plan: "team",
},
revenue: 0.20,
});
const result = await client.execute({
traceId: trace.data.id,
input: "Summarize this support ticket for the account manager.",
});
const completion = await client.traces.complete(trace.data.id);
console.log(completion.data.totalCost);
The trace is the durable record. The model response is useful to your product; the trace is useful to your operators, finance lead, and future routing policy.
How Costs Are Calculated
In modern AI applications, a single user request rarely equals a single model call. Requests often fan out—triggering tool loops, fallback routing, or multi-agent evaluations.
To provide an accurate financial audit trail, cost is calculated dynamically per step (every individual model execution) and then aggregated at the trace level (the overall user action).
Step Cost = ((Input Tokens ÷ 1,000) × Input Rate) + ((Output Tokens ÷ 1,000) × Output Rate)
Trace Total = Sum of all step costs within the trace
This bottom-up calculation gives you a precise breakdown, even when a single trace spans multiple models with different pricing tiers.
What Counts as Cost
Not every step costs money. Safety checks and routing decisions usually cost 0 because they run inside Olyx.
Model execution steps carry cost because they consume provider tokens or private inference capacity.
| Step type | Usually has cost? | Why |
|---|---|---|
check | No | PII and policy validation happen before model execution. |
route | No | Olyx chooses a model tier and records the decision. |
run | Yes | A provider or private model produced output. |
tool_call | Usually no | The model requested a tool; your app still decides how to execute it. |
tool_result | Usually no | Your app returned tool output back into the trace. |
If a trace has three run steps, the trace can have three separate costs. This is why Olyx reports both
steps[].cost and summary.total_cost.
Example: Multi-Step Trace
Imagine a user asks a complex question that requires the model to first call a database tool, and then summarize the result. This creates a
single trace with two distinct run steps on a model (rates: $0.005 per 1k input / $0.015 per 1k output).
- Step 1 (Tool Call): Uses 800 input tokens and 200 output tokens.
- Step 2 (Final Summary): Uses 400 input tokens and 100 output tokens.
Step 1: (800 / 1000 × $0.005) + (200 / 1000 × $0.015) = $0.0040 + $0.0030 = $0.0070
Step 2: (400 / 1000 × $0.005) + (100 / 1000 × $0.015) = $0.0020 + $0.0015 = $0.0035
Total = $0.0105
Both individual step costs are preserved in the API response under steps[].cost, while the $0.0105 sum lands
in the trace’s summary.total_cost.
Why Step-Level Granularity Matters:
-
Tool Calling Loops: If an agent gets stuck in a loop calling the same tool five times, step-level tracking isolates the exact cost of that loop, rather than hiding it in a blended trace total. -
Fallback Routing: If a request attempts a cheaper model, then falls back to a larger model, each successfulrunstep is recorded separately. If a provider charges for partial failures, your provider invoice remains the final source of truth; Olyx records the cost data it receives or calculates from completed steps. -
Prompt Caching: Many providers offer discounts for cached input tokens. Step-level accounting makes those discounts visible when token usage or model rates reflect cached-input pricing.
Infrastructure Breakdown
On the dashboard, costs are split by infrastructure type so you can compare public API spend against private inference capacity. This helps answer a practical question: “Are we paying a provider for this request, or did it run on our own infrastructure?”
| Bucket | Examples | How Olyx classifies it |
|---|---|---|
public_cloud | OpenAI, Anthropic, Mistral, Bedrock | Registered public provider model |
private | Ollama, vLLM, LM Studio, internal OpenAI-compatible gateway | Registered internal/private model |
The split depends on how you register the model. If a private vLLM endpoint is registered as a public provider, the cost report will treat it like public spend. Keep the Model Registry accurate so cost reports stay useful.
Key Differentiators
-
Public Cloud Monitoring: Tracks direct API spend and usage patterns across third-party providers.
-
Private/VPC Tracking: Shows how much traffic is going through private inference paths, including agent-backed routes.
-
Internal Chargebacks: For self-hosted models, you can define internal rates in the Model Registry. These rates can represent GPU amortization, team chargeback, or an estimated per-token operating cost.
-
Benchmarking: Once public and private models both have rates, you can compare whether a private route is actually cheaper for a specific workload.
-
Private Gateway Validation: For VPC-native deployments, use the outbound Olyx Agent path so private endpoints stay private and sensitive workloads fail closed when no private route is available.
Example Infrastructure Summary
{
"summary": {
"by_infrastructure": {
"public_cloud": 0.01872,
"private": 0.00410
}
}
}
This means the trace spent $0.01872 through public providers and $0.00410 through private infrastructure. It does
not mean Olyx charged either amount; it means the trace ledger attributed those operational costs to each bucket.
Optimization Grades
After each trace completes, Olyx calculates three letter grades (A–F) for the execution. All three appear in
summary.grades; summary.optimization_grade is the overall grade.
Grades are meant to be operational signals, not moral judgments. An F does not mean your code crashed. It means the
trace was unusually expensive, unusually slow, or routed to a model that looks too heavy for the task.
Overall grade
The composite grade summarizes cost efficiency and latency compared to your project’s historical baseline. When there is no history yet, Olyx uses your configured baseline values when available. During the first few days of a project, treat the overall grade as directional until enough real traffic exists.
| Grade | What it means | What to do |
|---|---|---|
| A | Best-in-class cost and speed | Nothing — keep it up |
| B | Slightly above baseline | Monitor for trends |
| C | Noticeably over budget or slow | Consider a cheaper model |
| D | Significantly over threshold | Reconfigure routing |
| F | Critical — high cost or failure | Investigate immediately |
Grades calibrate against your Grading Baseline in Settings and then become more useful as real traces accumulate. Set a seed baseline on day one, then review it after you have enough production-like traffic.
Waste Grade
Waste grade answers one question:
Did I use a more expensive model than necessary?
Olyx calculates this by comparing the actual blended cost-per-1k tokens used against the cheapest model available in your project.
Waste is about model choice, not total dollars. A trace can cost only $0.002 and still get a poor waste grade if it
used a frontier model for a task that your configured small model could have handled. That makes waste grade useful
for high-volume endpoints, where tiny per-request mistakes compound quickly.
How to interpret it:
| Grade | What it means | What to do |
|---|---|---|
| A | Within 20% of the cheapest option | OPTIMAL — no action needed |
| B | Up to 2× more expensive | WATCH — review high-volume endpoints |
| C | Up to 4× more expensive | REVIEW — likely overusing mid-tier models |
| D | Up to 8× more expensive | MISCONFIGURED — routing needs attention |
| F | >8× more expensive | BLOCKED — frontier models used unnecessarily |
Example:
REQUEST -> "Summarize this paragraph"
MODEL USED -> gpt-4o
CHEAPEST AVAILABLE -> gpt-4o-mini
RESULT -> Waste Grade: F
Same output quality could have been achieved at a fraction of the cost.
Common Causes of High Waste
- No simple tier configured
- Fallbacks defaulting to expensive models
- Analyzer misclassifying simple tasks as complex
- Agent loops escalating to higher-tier models
How to Debug High Waste
Start with one bad trace, not a dashboard average:
- Open the trace and check
summary.by_model. - Find the expensive
runstep. - Check the routing decision for the selected tier.
- Compare the selected model to the cheapest configured model that could handle the task.
- Adjust the model tier, prompt metadata, or fallback chain.
For example, if a summarization endpoint sends short support tickets to gpt-4o, add a cheaper model to the Simple
tier and make sure the endpoint metadata says intent: "summarization".
Latency Grade
Latency grade answers: How fast did this request complete, regardless of your setup?
This is an absolute measure, not relative to your project. It is based on model execution latency recorded in run steps. It is useful for spotting requests where the model path is too slow, even when the total cost is acceptable.
How to interpret it:
| Grade | Total latency | What it means |
|---|---|---|
| A | ≤ 800 ms | READY — real-time UX ready |
| B | ≤ 2,000 ms | ACCEPTABLE — works for most apps |
| C | ≤ 4,000 ms | NOTICEABLE — users may feel delay |
| D | ≤ 8,000 ms | POOR — degrades UX |
| F | > 8,000 ms | BLOCKED — broken or blocking flow |
MODEL -> gpt-4o
LATENCY -> 5.2s
RESULT -> Latency Grade: D
Even if cost is optimized, a D-grade latency will degrade user experience.
Common causes of poor latency:
- Overuse of frontier models
- Long context windows
- Tool / agent chaining
- No streaming enabled
- Cold-start private models
How to Debug Poor Latency
Look at the slowest step first. A trace with one slow run step usually needs a faster model or streaming. A trace
with many medium-speed run steps usually needs fewer model turns, better tool design, or a tighter agent loop.
| Symptom | Likely cause | First fix |
|---|---|---|
One slow run step | Heavy model or large context | Try a faster tier or reduce prompt size |
Many run steps | Agent/tool loop | Add a max-iteration guard |
| Slow private model | Cold start or overloaded GPU | Warm the model or add capacity |
| Good model latency, poor UX | App/network work outside Olyx | Check your product-side timing separately |
Putting It Together
These grades are meant to be read together:
| Waste Grade | Latency Grade | What it means | Suggested action |
|---|---|---|---|
| A | C | Cheap, but slow | TUNE — optimize model speed or enable streaming |
| F | A | Fast, but overpaying | ROUTING — add cheaper model tiers |
| B | B | Balanced, but not optimal | WATCH — review high-volume endpoints |
Trace Economics
Every trace detail response includes a fully attributed cost breakdown. Think of this as the cost object for one user-visible action.
{
"summary": {
"total_cost": 0.00545,
"optimization_grade": "B",
"grades": {
"overall": "B",
"waste": "A",
"latency": "B"
},
"by_model": {
"gpt-4o": 0.00545
},
"by_infrastructure": {
"public_cloud": 0.00545,
"private": 0.00000
}
}
}
Field Reference
| Field | Meaning | Use it for |
|---|---|---|
summary.total_cost | Sum of all recorded step costs | Per-request cost |
summary.optimization_grade | Composite grade | Fast triage |
summary.grades.waste | Model-selection efficiency | Routing cleanup |
summary.grades.latency | Model execution speed | UX and performance work |
summary.by_model | Cost grouped by model identifier | Finding expensive models |
summary.by_infrastructure | Cost grouped into public/private buckets | Public-vs-private spend |
summary.revenue | Optional revenue supplied by your app | Unit economics |
summary.gross_margin | Revenue minus cost | Margin analysis |
How to Use the Summary
- Identify expensive models.
"by_model": {
"gpt-4o": 0.00545
}
Route simpler requests to cheaper alternatives.
- Detect infrastructure concentration risk.
"by_infrastructure": {
"public_cloud": 0.00545,
"private": 0.00000
}
Add fallback or private models if all sensitive or high-volume traffic is concentrated in one bucket.
- Track optimization over time.
Use the list endpoint for a recent sample and the detail endpoint when you need step-level evidence:
GET /api/v1/traces?per_page=100
Authorization: Bearer ak_<key_id>.<secret>
GET /api/v1/traces/:id
Authorization: Bearer ak_<key_id>.<secret>
In closed beta, keep reporting jobs simple: fetch recent traces, ignore traces that are still pending, and group the
remaining rows by grade, model, feature, or tenant metadata.
Monitor
Track these three signals over time to know whether your routing configuration is improving:
| Signal | Healthy direction | Stale warning |
|---|---|---|
| Waste grade distribution | Fewer D/F grades week-over-week | Average grade unchanged after routing changes |
| Latency grade distribution | Shifting toward A/B over time | p95 latency rising despite no model-tier change |
| Cost per request | Flat or decreasing as volume scales | Rising cost-per-request without added complexity |
Fetch a recent window of traces and check the distribution yourself:
# Ruby — sample recent trace summaries
response = client.traces.list(per_page: 100)
completed = Array(response["data"]).select { |trace| trace["status"] == "completed" }
waste_dist = completed
.map { |trace| trace.dig("grades", "waste") }
.compact
.tally
# => { "A" => 42, "B" => 31, "C" => 18, "D" => 6, "F" => 3 }
# Python — sample recent trace summaries
from collections import Counter
traces = client.traces.list(per_page=100)
completed = [trace for trace in traces if trace.status == "completed"]
waste_grades = [
trace.raw.get("grades", {}).get("waste")
for trace in completed
if trace.raw.get("grades", {}).get("waste")
]
waste_dist = Counter(waste_grades)
# Counter({'A': 42, 'B': 31, 'C': 18, 'D': 6, 'F': 3})
A healthy project shows the A/B share growing as a proportion of total requests over weeks. If D/F grades dominate after you’ve configured routing tiers, the analyzer may be misclassifying request complexity — check your Grading Baseline in Settings and confirm your model tiers cover both simple and complex tasks.
Request Margin
If you charge users per request, Olyx can compute margin per trace. Pass the revenue you earned when you create the
trace. Olyx stores that value and calculates gross_margin when you inspect the completed trace.
Use whatever revenue unit matches your product:
| Product model | Revenue value to attach |
|---|---|
| Per-request billing | The exact amount charged for this request |
| Usage credits | The dollar value of credits consumed |
| Subscription plan | An allocated per-action value, or omit revenue until you have an allocation model |
| Internal tool | An internal chargeback value, if your team uses one |
Attach revenue at execution time:
POST /api/v1/traces
Authorization: Bearer ak_<key_id>.<secret>
Content-Type: application/json
{
"revenue": 0.50,
"metadata": {
"user_id": "user_123"
}
}
Response:
{
"id": "tr_abc123",
"status": "pending",
"created_at": "2026-05-14T12:00:00Z",
"metadata": {
"user_id": "user_123"
},
"revenue": 0.50
}
After the model call completes, complete the trace. The complete response includes the computed summary, so common request-margin flows do not need an extra read.
PATCH /api/v1/traces/tr_abc123/complete
Authorization: Bearer ak_<key_id>.<secret>
{
"id": "tr_abc123",
"status": "completed",
"total_cost": 0.00318,
"summary": {
"total_cost": 0.00318,
"revenue": 0.50,
"gross_margin": 0.49682,
"grades": {
"overall": "B",
"waste": "A",
"latency": "B"
}
}
}
Use GET /api/v1/traces/:id when you also need the full step list, graph, or routing decision.
Gross margin is intentionally simple:
Gross Margin = Revenue - Total Cost
It does not include salaries, support time, taxes, payment processing fees, or your provider’s final invoice adjustments. Use it as a request-level signal, not a complete finance ledger.
Why This Matters
Cost intelligence shifts the focus from managing a budget to managing unit economics. This granularity changes your engineering and product strategy in three critical ways:
-
Per-request profitability: By passing revenue in your trace metadata, Olyx calculates the gross margin of every individual user interaction. You can identify exactly which user segments or specific features are subsidizing others.
-
Dynamic model routing: You can implement logic that optimizes for margin. If a request is low-priority or has low revenue potential, the system can automatically route it to a more cost-effective model without manual intervention.
-
Agent behavior control: Autonomous agents can inadvertently enter “thought loops” that consume thousands of tokens in seconds. Cost intelligence acts as a circuit breaker, detecting these patterns and terminating the trace before it hits a budget cap.
Common Patterns in Practice
| Pattern | Operational Signal | Decision Trigger |
|---|---|---|
| SaaS Pricing Validation | Margin falls below target (e.g., < 70%). | Adjust subscription tiers or switch to a more efficient model. |
| Detect Runaway Agents | Trace shows high cost frequency with low output variation. | Trigger a loop-detection event and alert the developer. |
| Optimize Tier Routing | ”Waste Grade” is high on simple tasks. | Reconfigure routing tiers so small tasks do not hit the most expensive models. |
The Mental Model
Olyx standardizes the lifecycle of an AI request into a linear economic flow:
INPUT -> tokens, intent, metadata
EXECUTION -> model, provider, infrastructure path
COST -> configured rates multiplied by usage
MARGIN -> optional revenue minus total cost
TRACE -> audit record for debugging and optimization
Instead of viewing LLM spend as an unpredictable overhead, it becomes a line item with a measurable Return on Investment (ROI). You are no longer guessing whether a specific model upgrade was “worth it”; the data shows the impact on your bottom line immediately.
The Key Shift
-
Without Olyx: LLM costs are a black box that you discover at the end of the month when the provider invoice arrives.
-
With Olyx: Every governed request has a trace, a cost, a latency profile, and a routing decision you can inspect.