Cost Intelligence

Cost intelligence turns every governed AI request into a small cost ledger. Each trace records what model ran, how many tokens it used, how long it took, what it cost, and whether that cost made sense for the work being done.

This is not a billing processor and it does not charge your customers. During closed beta, treat Olyx cost data as an operational estimate based on the model rates you configure, similar to how you would use observability data to understand infrastructure spend before reconciling the final provider invoice.

Cost tracking is only accurate after you define token rates in the Model Registry. On every model call, Olyx multiplies input and output tokens by the rates you configured, then sums those figures across all steps in the trace. That gives you spend at the step, trace, model, project, and infrastructure levels without adding separate instrumentation to every call site.

The Cost Objects

If you are new to LLM cost tracking, start with these objects. They are the cost equivalent of a payment ledger: small, explicit records that can be inspected after the request finishes.

Object	What it means	Where you see it
Model rate	The input and output token price for one model. You define it.	Model Registry
Step	One unit of work inside a trace, usually validation, routing, model execution, or tool handling.	`steps[]`
Run step	A step that actually calls a model and can carry token cost.	`steps[].cost`
Trace	One user-visible action, such as “summarize this PDF” or “answer this chat turn.”	`summary.total_cost`
Revenue	Optional amount you earned for the request, supplied by your app.	`summary.revenue`
Gross margin	`revenue - total_cost`; useful for unit economics, not invoicing.	`summary.gross_margin`
Grade	A letter signal that compares cost and latency to expected behavior.	`summary.grades`

The important rule: cost belongs to traces, not pages or users by default. Attach metadata such as user_id, feature, plan, or tenant_id when you create the trace if you want to roll costs up later.

Setup Flow

For a first integration, wire costs in this order:

Register each model in the Model Registry.
Add input and output token rates for each model you want to track.
Create a trace for the user action.
Execute one or more model calls against that trace.
Complete the trace so Olyx can calculate grades.
Read summary.total_cost, summary.grades, and summary.by_model from the completion response.

const trace = await client.traces.create({
  metadata: {
    userId: "u_123",
    feature: "support_summarizer",
    plan: "team",
  },
  revenue: 0.20,
});

const result = await client.execute({
  traceId: trace.data.id,
  input: "Summarize this support ticket for the account manager.",
});

const completion = await client.traces.complete(trace.data.id);

console.log(completion.data.totalCost);

The trace is the durable record. The model response is useful to your product; the trace is useful to your operators, finance lead, and future routing policy.

How Costs Are Calculated

In modern AI applications, a single user request rarely equals a single model call. Requests often fan out—triggering tool loops, fallback routing, or multi-agent evaluations.

To provide an accurate financial audit trail, cost is calculated dynamically per step (every individual model execution) and then aggregated at the trace level (the overall user action).

Step Cost = ((Input Tokens ÷ 1,000) × Input Rate) + ((Output Tokens ÷ 1,000) × Output Rate)

Trace Total = Sum of all step costs within the trace

This bottom-up calculation gives you a precise breakdown, even when a single trace spans multiple models with different pricing tiers.

What Counts as Cost

Not every step costs money. Safety checks and routing decisions usually cost 0 because they run inside Olyx. Model execution steps carry cost because they consume provider tokens or private inference capacity.

Step type	Usually has cost?	Why
`check`	No	PII and policy validation happen before model execution.
`route`	No	Olyx chooses a model tier and records the decision.
`run`	Yes	A provider or private model produced output.
`tool_call`	Usually no	The model requested a tool; your app still decides how to execute it.
`tool_result`	Usually no	Your app returned tool output back into the trace.

If a trace has three run steps, the trace can have three separate costs. This is why Olyx reports both steps[].cost and summary.total_cost.

Example: Multi-Step Trace

Imagine a user asks a complex question that requires the model to first call a database tool, and then summarize the result. This creates a single trace with two distinct run steps on a model (rates: $0.005 per 1k input / $0.015 per 1k output).

Step 1 (Tool Call): Uses 800 input tokens and 200 output tokens.
Step 2 (Final Summary): Uses 400 input tokens and 100 output tokens.

Step 1:  (800 / 1000 × $0.005) + (200 / 1000 × $0.015)  = $0.0040 + $0.0030 = $0.0070
Step 2:  (400 / 1000 × $0.005) + (100 / 1000 × $0.015)  = $0.0020 + $0.0015 = $0.0035
                                                                    Total   = $0.0105

Both individual step costs are preserved in the API response under steps[].cost, while the $0.0105 sum lands in the trace’s summary.total_cost.

Why Step-Level Granularity Matters:

Tool Calling Loops: If an agent gets stuck in a loop calling the same tool five times, step-level tracking isolates the exact cost of that loop, rather than hiding it in a blended trace total.
Fallback Routing: If a request attempts a cheaper model, then falls back to a larger model, each successful run step is recorded separately. If a provider charges for partial failures, your provider invoice remains the final source of truth; Olyx records the cost data it receives or calculates from completed steps.
Prompt Caching: Many providers offer discounts for cached input tokens. Step-level accounting makes those discounts visible when token usage or model rates reflect cached-input pricing.

Infrastructure Breakdown

On the dashboard, costs are split by infrastructure type so you can compare public API spend against private inference capacity. This helps answer a practical question: “Are we paying a provider for this request, or did it run on our own infrastructure?”

Bucket	Examples	How Olyx classifies it
`public_cloud`	OpenAI, Anthropic, Mistral, Bedrock	Registered public provider model
`private`	Ollama, vLLM, LM Studio, internal OpenAI-compatible gateway	Registered internal/private model

The split depends on how you register the model. If a private vLLM endpoint is registered as a public provider, the cost report will treat it like public spend. Keep the Model Registry accurate so cost reports stay useful.

Key Differentiators

Public Cloud Monitoring: Tracks direct API spend and usage patterns across third-party providers.
Private/VPC Tracking: Shows how much traffic is going through private inference paths, including agent-backed routes.
Internal Chargebacks: For self-hosted models, you can define internal rates in the Model Registry. These rates can represent GPU amortization, team chargeback, or an estimated per-token operating cost.
Benchmarking: Once public and private models both have rates, you can compare whether a private route is actually cheaper for a specific workload.
Private Gateway Validation: For VPC-native deployments, use the outbound Olyx Agent path so private endpoints stay private and sensitive workloads fail closed when no private route is available.

Example Infrastructure Summary

{
  "summary": {
    "by_infrastructure": {
      "public_cloud": 0.01872,
      "private": 0.00410
    }
  }
}

This means the trace spent $0.01872 through public providers and $0.00410 through private infrastructure. It does not mean Olyx charged either amount; it means the trace ledger attributed those operational costs to each bucket.

Optimization Grades

After each trace completes, Olyx calculates three letter grades (A–F) for the execution. All three appear in summary.grades; summary.optimization_grade is the overall grade.

Grades are meant to be operational signals, not moral judgments. An F does not mean your code crashed. It means the trace was unusually expensive, unusually slow, or routed to a model that looks too heavy for the task.

Overall grade

The composite grade summarizes cost efficiency and latency compared to your project’s historical baseline. When there is no history yet, Olyx uses your configured baseline values when available. During the first few days of a project, treat the overall grade as directional until enough real traffic exists.

Grade	What it means	What to do
A	Best-in-class cost and speed	Nothing — keep it up
B	Slightly above baseline	Monitor for trends
C	Noticeably over budget or slow	Consider a cheaper model
D	Significantly over threshold	Reconfigure routing
F	Critical — high cost or failure	Investigate immediately

Grades calibrate against your Grading Baseline in Settings and then become more useful as real traces accumulate. Set a seed baseline on day one, then review it after you have enough production-like traffic.

Waste Grade

Waste grade answers one question:

Did I use a more expensive model than necessary?

Olyx calculates this by comparing the actual blended cost-per-1k tokens used against the cheapest model available in your project.

Waste is about model choice, not total dollars. A trace can cost only $0.002 and still get a poor waste grade if it used a frontier model for a task that your configured small model could have handled. That makes waste grade useful for high-volume endpoints, where tiny per-request mistakes compound quickly.

How to interpret it:

Grade	What it means	What to do
A	Within 20% of the cheapest option	`OPTIMAL` — no action needed
B	Up to 2× more expensive	`WATCH` — review high-volume endpoints
C	Up to 4× more expensive	`REVIEW` — likely overusing mid-tier models
D	Up to 8× more expensive	`MISCONFIGURED` — routing needs attention
F	>8× more expensive	`BLOCKED` — frontier models used unnecessarily

Example:

REQUEST -> "Summarize this paragraph"
MODEL USED -> gpt-4o
CHEAPEST AVAILABLE -> gpt-4o-mini

RESULT -> Waste Grade: F

Same output quality could have been achieved at a fraction of the cost.

Common Causes of High Waste

No simple tier configured
Fallbacks defaulting to expensive models
Analyzer misclassifying simple tasks as complex
Agent loops escalating to higher-tier models

How to Debug High Waste

Start with one bad trace, not a dashboard average:

Open the trace and check summary.by_model.
Find the expensive run step.
Check the routing decision for the selected tier.
Compare the selected model to the cheapest configured model that could handle the task.
Adjust the model tier, prompt metadata, or fallback chain.

For example, if a summarization endpoint sends short support tickets to gpt-4o, add a cheaper model to the Simple tier and make sure the endpoint metadata says intent: "summarization".

Latency Grade

Latency grade answers: How fast did this request complete, regardless of your setup?

This is an absolute measure, not relative to your project. It is based on model execution latency recorded in run steps. It is useful for spotting requests where the model path is too slow, even when the total cost is acceptable.

How to interpret it:

Grade	Total latency	What it means
A	≤ 800 ms	`READY` — real-time UX ready
B	≤ 2,000 ms	`ACCEPTABLE` — works for most apps
C	≤ 4,000 ms	`NOTICEABLE` — users may feel delay
D	≤ 8,000 ms	`POOR` — degrades UX
F	> 8,000 ms	`BLOCKED` — broken or blocking flow

MODEL   -> gpt-4o
LATENCY -> 5.2s

RESULT  -> Latency Grade: D

Even if cost is optimized, a D-grade latency will degrade user experience.

Common causes of poor latency:

Overuse of frontier models
Long context windows
Tool / agent chaining
No streaming enabled
Cold-start private models

How to Debug Poor Latency

Look at the slowest step first. A trace with one slow run step usually needs a faster model or streaming. A trace with many medium-speed run steps usually needs fewer model turns, better tool design, or a tighter agent loop.

Symptom	Likely cause	First fix
One slow `run` step	Heavy model or large context	Try a faster tier or reduce prompt size
Many `run` steps	Agent/tool loop	Add a max-iteration guard
Slow private model	Cold start or overloaded GPU	Warm the model or add capacity
Good model latency, poor UX	App/network work outside Olyx	Check your product-side timing separately

Putting It Together

These grades are meant to be read together:

Waste Grade	Latency Grade	What it means	Suggested action
A	C	Cheap, but slow	`TUNE` — optimize model speed or enable streaming
F	A	Fast, but overpaying	`ROUTING` — add cheaper model tiers
B	B	Balanced, but not optimal	`WATCH` — review high-volume endpoints

Trace Economics

Every trace detail response includes a fully attributed cost breakdown. Think of this as the cost object for one user-visible action.

{
  "summary": {
    "total_cost": 0.00545,
    "optimization_grade": "B",
    "grades": {
      "overall":  "B",
      "waste":    "A",
      "latency":  "B"
    },
    "by_model": {
      "gpt-4o": 0.00545
    },
    "by_infrastructure": {
      "public_cloud": 0.00545,
      "private": 0.00000
    }
  }
}

Field Reference

Field	Meaning	Use it for
`summary.total_cost`	Sum of all recorded step costs	Per-request cost
`summary.optimization_grade`	Composite grade	Fast triage
`summary.grades.waste`	Model-selection efficiency	Routing cleanup
`summary.grades.latency`	Model execution speed	UX and performance work
`summary.by_model`	Cost grouped by model identifier	Finding expensive models
`summary.by_infrastructure`	Cost grouped into public/private buckets	Public-vs-private spend
`summary.revenue`	Optional revenue supplied by your app	Unit economics
`summary.gross_margin`	Revenue minus cost	Margin analysis

How to Use the Summary

Identify expensive models.

"by_model": {
  "gpt-4o": 0.00545
}

Route simpler requests to cheaper alternatives.

Detect infrastructure concentration risk.

"by_infrastructure": {
  "public_cloud": 0.00545,
  "private": 0.00000
}

Add fallback or private models if all sensitive or high-volume traffic is concentrated in one bucket.

Track optimization over time.

Use the list endpoint for a recent sample and the detail endpoint when you need step-level evidence:

GET /api/v1/traces?per_page=100
Authorization: Bearer ak_<key_id>.<secret>

GET /api/v1/traces/:id
Authorization: Bearer ak_<key_id>.<secret>

In closed beta, keep reporting jobs simple: fetch recent traces, ignore traces that are still pending, and group the remaining rows by grade, model, feature, or tenant metadata.

Monitor

Track these three signals over time to know whether your routing configuration is improving:

Signal	Healthy direction	Stale warning
Waste grade distribution	Fewer D/F grades week-over-week	Average grade unchanged after routing changes
Latency grade distribution	Shifting toward A/B over time	p95 latency rising despite no model-tier change
Cost per request	Flat or decreasing as volume scales	Rising cost-per-request without added complexity

Fetch a recent window of traces and check the distribution yourself:

# Ruby — sample recent trace summaries
response = client.traces.list(per_page: 100)
completed = Array(response["data"]).select { |trace| trace["status"] == "completed" }

waste_dist = completed
  .map { |trace| trace.dig("grades", "waste") }
  .compact
  .tally
# => { "A" => 42, "B" => 31, "C" => 18, "D" => 6, "F" => 3 }

# Python — sample recent trace summaries
from collections import Counter

traces = client.traces.list(per_page=100)
completed = [trace for trace in traces if trace.status == "completed"]

waste_grades = [
    trace.raw.get("grades", {}).get("waste")
    for trace in completed
    if trace.raw.get("grades", {}).get("waste")
]
waste_dist = Counter(waste_grades)
# Counter({'A': 42, 'B': 31, 'C': 18, 'D': 6, 'F': 3})

A healthy project shows the A/B share growing as a proportion of total requests over weeks. If D/F grades dominate after you’ve configured routing tiers, the analyzer may be misclassifying request complexity — check your Grading Baseline in Settings and confirm your model tiers cover both simple and complex tasks.

Request Margin

If you charge users per request, Olyx can compute margin per trace. Pass the revenue you earned when you create the trace. Olyx stores that value and calculates gross_margin when you inspect the completed trace.

Use whatever revenue unit matches your product:

Product model	Revenue value to attach
Per-request billing	The exact amount charged for this request
Usage credits	The dollar value of credits consumed
Subscription plan	An allocated per-action value, or omit revenue until you have an allocation model
Internal tool	An internal chargeback value, if your team uses one

Attach revenue at execution time:

POST /api/v1/traces
Authorization: Bearer ak_<key_id>.<secret>
Content-Type: application/json

{
  "revenue": 0.50,
  "metadata": {
    "user_id": "user_123"
  }
}

Response:

{
  "id": "tr_abc123",
  "status": "pending",
  "created_at": "2026-05-14T12:00:00Z",
  "metadata": {
    "user_id": "user_123"
  },
  "revenue": 0.50
}

After the model call completes, complete the trace. The complete response includes the computed summary, so common request-margin flows do not need an extra read.

PATCH /api/v1/traces/tr_abc123/complete
Authorization: Bearer ak_<key_id>.<secret>

{
  "id": "tr_abc123",
  "status": "completed",
  "total_cost": 0.00318,
  "summary": {
    "total_cost":   0.00318,
    "revenue":      0.50,
    "gross_margin": 0.49682,
    "grades": {
      "overall": "B",
      "waste": "A",
      "latency": "B"
    }
  }
}

Use GET /api/v1/traces/:id when you also need the full step list, graph, or routing decision.

Gross margin is intentionally simple:

Gross Margin = Revenue - Total Cost

It does not include salaries, support time, taxes, payment processing fees, or your provider’s final invoice adjustments. Use it as a request-level signal, not a complete finance ledger.

Why This Matters

Cost intelligence shifts the focus from managing a budget to managing unit economics. This granularity changes your engineering and product strategy in three critical ways:

Per-request profitability: By passing revenue in your trace metadata, Olyx calculates the gross margin of every individual user interaction. You can identify exactly which user segments or specific features are subsidizing others.
Dynamic model routing: You can implement logic that optimizes for margin. If a request is low-priority or has low revenue potential, the system can automatically route it to a more cost-effective model without manual intervention.
Agent behavior control: Autonomous agents can inadvertently enter “thought loops” that consume thousands of tokens in seconds. Cost intelligence acts as a circuit breaker, detecting these patterns and terminating the trace before it hits a budget cap.

Common Patterns in Practice

Pattern	Operational Signal	Decision Trigger
SaaS Pricing Validation	Margin falls below target (e.g., < 70%).	Adjust subscription tiers or switch to a more efficient model.
Detect Runaway Agents	Trace shows high cost frequency with low output variation.	Trigger a loop-detection event and alert the developer.
Optimize Tier Routing	”Waste Grade” is high on simple tasks.	Reconfigure routing tiers so small tasks do not hit the most expensive models.

The Mental Model

Olyx standardizes the lifecycle of an AI request into a linear economic flow:

INPUT        -> tokens, intent, metadata
EXECUTION    -> model, provider, infrastructure path
COST         -> configured rates multiplied by usage
MARGIN       -> optional revenue minus total cost
TRACE        -> audit record for debugging and optimization

Instead of viewing LLM spend as an unpredictable overhead, it becomes a line item with a measurable Return on Investment (ROI). You are no longer guessing whether a specific model upgrade was “worth it”; the data shows the impact on your bottom line immediately.

The Key Shift

Without Olyx: LLM costs are a black box that you discover at the end of the month when the provider invoice arrives.
With Olyx: Every governed request has a trace, a cost, a latency profile, and a routing decision you can inspect.