Cost Intelligence

Cost intelligence turns every governed AI request into a small cost ledger. Each trace records what model ran, how many tokens it used, how long it took, what it cost, and whether that cost made sense for the work being done.

This is not a billing processor and it does not charge your customers. During closed beta, treat Olyx cost data as an operational estimate based on the model rates you configure, similar to how you would use observability data to understand infrastructure spend before reconciling the final provider invoice.

Cost tracking is only accurate after you define token rates in the Model Registry. On every model call, Olyx multiplies input and output tokens by the rates you configured, then sums those figures across all steps in the trace. That gives you spend at the step, trace, model, project, and infrastructure levels without adding separate instrumentation to every call site.

The Cost Objects

If you are new to LLM cost tracking, start with these objects. They are the cost equivalent of a payment ledger: small, explicit records that can be inspected after the request finishes.

ObjectWhat it meansWhere you see it
Model rateThe input and output token price for one model. You define it.Model Registry
StepOne unit of work inside a trace, usually validation, routing, model execution, or tool handling.steps[]
Run stepA step that actually calls a model and can carry token cost.steps[].cost
TraceOne user-visible action, such as “summarize this PDF” or “answer this chat turn.”summary.total_cost
RevenueOptional amount you earned for the request, supplied by your app.summary.revenue
Gross marginrevenue - total_cost; useful for unit economics, not invoicing.summary.gross_margin
GradeA letter signal that compares cost and latency to expected behavior.summary.grades

The important rule: cost belongs to traces, not pages or users by default. Attach metadata such as user_id, feature, plan, or tenant_id when you create the trace if you want to roll costs up later.

Setup Flow

For a first integration, wire costs in this order:

  1. Register each model in the Model Registry.
  2. Add input and output token rates for each model you want to track.
  3. Create a trace for the user action.
  4. Execute one or more model calls against that trace.
  5. Complete the trace so Olyx can calculate grades.
  6. Read summary.total_cost, summary.grades, and summary.by_model from the completion response.
const trace = await client.traces.create({
  metadata: {
    userId: "u_123",
    feature: "support_summarizer",
    plan: "team",
  },
  revenue: 0.20,
});

const result = await client.execute({
  traceId: trace.data.id,
  input: "Summarize this support ticket for the account manager.",
});

const completion = await client.traces.complete(trace.data.id);

console.log(completion.data.totalCost);

The trace is the durable record. The model response is useful to your product; the trace is useful to your operators, finance lead, and future routing policy.


How Costs Are Calculated

In modern AI applications, a single user request rarely equals a single model call. Requests often fan out—triggering tool loops, fallback routing, or multi-agent evaluations.

To provide an accurate financial audit trail, cost is calculated dynamically per step (every individual model execution) and then aggregated at the trace level (the overall user action).

Step Cost = ((Input Tokens ÷ 1,000) × Input Rate) + ((Output Tokens ÷ 1,000) × Output Rate)

Trace Total = Sum of all step costs within the trace

This bottom-up calculation gives you a precise breakdown, even when a single trace spans multiple models with different pricing tiers.

What Counts as Cost

Not every step costs money. Safety checks and routing decisions usually cost 0 because they run inside Olyx. Model execution steps carry cost because they consume provider tokens or private inference capacity.

Step typeUsually has cost?Why
checkNoPII and policy validation happen before model execution.
routeNoOlyx chooses a model tier and records the decision.
runYesA provider or private model produced output.
tool_callUsually noThe model requested a tool; your app still decides how to execute it.
tool_resultUsually noYour app returned tool output back into the trace.

If a trace has three run steps, the trace can have three separate costs. This is why Olyx reports both steps[].cost and summary.total_cost.

Example: Multi-Step Trace

Imagine a user asks a complex question that requires the model to first call a database tool, and then summarize the result. This creates a single trace with two distinct run steps on a model (rates: $0.005 per 1k input / $0.015 per 1k output).

  • Step 1 (Tool Call): Uses 800 input tokens and 200 output tokens.
  • Step 2 (Final Summary): Uses 400 input tokens and 100 output tokens.
Step 1:  (800 / 1000 × $0.005) + (200 / 1000 × $0.015)  = $0.0040 + $0.0030 = $0.0070
Step 2:  (400 / 1000 × $0.005) + (100 / 1000 × $0.015)  = $0.0020 + $0.0015 = $0.0035
                                                                    Total   = $0.0105

Both individual step costs are preserved in the API response under steps[].cost, while the $0.0105 sum lands in the trace’s summary.total_cost.

Why Step-Level Granularity Matters:

  • Tool Calling Loops: If an agent gets stuck in a loop calling the same tool five times, step-level tracking isolates the exact cost of that loop, rather than hiding it in a blended trace total.

  • Fallback Routing: If a request attempts a cheaper model, then falls back to a larger model, each successful run step is recorded separately. If a provider charges for partial failures, your provider invoice remains the final source of truth; Olyx records the cost data it receives or calculates from completed steps.

  • Prompt Caching: Many providers offer discounts for cached input tokens. Step-level accounting makes those discounts visible when token usage or model rates reflect cached-input pricing.


Infrastructure Breakdown

On the dashboard, costs are split by infrastructure type so you can compare public API spend against private inference capacity. This helps answer a practical question: “Are we paying a provider for this request, or did it run on our own infrastructure?”

BucketExamplesHow Olyx classifies it
public_cloudOpenAI, Anthropic, Mistral, BedrockRegistered public provider model
privateOllama, vLLM, LM Studio, internal OpenAI-compatible gatewayRegistered internal/private model

The split depends on how you register the model. If a private vLLM endpoint is registered as a public provider, the cost report will treat it like public spend. Keep the Model Registry accurate so cost reports stay useful.

Key Differentiators

  • Public Cloud Monitoring: Tracks direct API spend and usage patterns across third-party providers.

  • Private/VPC Tracking: Shows how much traffic is going through private inference paths, including agent-backed routes.

  • Internal Chargebacks: For self-hosted models, you can define internal rates in the Model Registry. These rates can represent GPU amortization, team chargeback, or an estimated per-token operating cost.

  • Benchmarking: Once public and private models both have rates, you can compare whether a private route is actually cheaper for a specific workload.

  • Private Gateway Validation: For VPC-native deployments, use the outbound Olyx Agent path so private endpoints stay private and sensitive workloads fail closed when no private route is available.

Example Infrastructure Summary

{
  "summary": {
    "by_infrastructure": {
      "public_cloud": 0.01872,
      "private": 0.00410
    }
  }
}

This means the trace spent $0.01872 through public providers and $0.00410 through private infrastructure. It does not mean Olyx charged either amount; it means the trace ledger attributed those operational costs to each bucket.


Optimization Grades

After each trace completes, Olyx calculates three letter grades (A–F) for the execution. All three appear in summary.grades; summary.optimization_grade is the overall grade.

Grades are meant to be operational signals, not moral judgments. An F does not mean your code crashed. It means the trace was unusually expensive, unusually slow, or routed to a model that looks too heavy for the task.

Overall grade

The composite grade summarizes cost efficiency and latency compared to your project’s historical baseline. When there is no history yet, Olyx uses your configured baseline values when available. During the first few days of a project, treat the overall grade as directional until enough real traffic exists.

GradeWhat it meansWhat to do
ABest-in-class cost and speedNothing — keep it up
BSlightly above baselineMonitor for trends
CNoticeably over budget or slowConsider a cheaper model
DSignificantly over thresholdReconfigure routing
FCritical — high cost or failureInvestigate immediately

Grades calibrate against your Grading Baseline in Settings and then become more useful as real traces accumulate. Set a seed baseline on day one, then review it after you have enough production-like traffic.

Waste Grade

Waste grade answers one question:

Did I use a more expensive model than necessary?

Olyx calculates this by comparing the actual blended cost-per-1k tokens used against the cheapest model available in your project.

Waste is about model choice, not total dollars. A trace can cost only $0.002 and still get a poor waste grade if it used a frontier model for a task that your configured small model could have handled. That makes waste grade useful for high-volume endpoints, where tiny per-request mistakes compound quickly.

How to interpret it:

GradeWhat it meansWhat to do
AWithin 20% of the cheapest optionOPTIMAL — no action needed
BUp to 2× more expensiveWATCH — review high-volume endpoints
CUp to 4× more expensiveREVIEW — likely overusing mid-tier models
DUp to 8× more expensiveMISCONFIGURED — routing needs attention
F>8× more expensiveBLOCKED — frontier models used unnecessarily

Example:

REQUEST -> "Summarize this paragraph"
MODEL USED -> gpt-4o
CHEAPEST AVAILABLE -> gpt-4o-mini

RESULT -> Waste Grade: F

Same output quality could have been achieved at a fraction of the cost.

Common Causes of High Waste

  • No simple tier configured
  • Fallbacks defaulting to expensive models
  • Analyzer misclassifying simple tasks as complex
  • Agent loops escalating to higher-tier models

How to Debug High Waste

Start with one bad trace, not a dashboard average:

  1. Open the trace and check summary.by_model.
  2. Find the expensive run step.
  3. Check the routing decision for the selected tier.
  4. Compare the selected model to the cheapest configured model that could handle the task.
  5. Adjust the model tier, prompt metadata, or fallback chain.

For example, if a summarization endpoint sends short support tickets to gpt-4o, add a cheaper model to the Simple tier and make sure the endpoint metadata says intent: "summarization".

Latency Grade

Latency grade answers: How fast did this request complete, regardless of your setup?

This is an absolute measure, not relative to your project. It is based on model execution latency recorded in run steps. It is useful for spotting requests where the model path is too slow, even when the total cost is acceptable.

How to interpret it:

GradeTotal latencyWhat it means
A≤ 800 msREADY — real-time UX ready
B≤ 2,000 msACCEPTABLE — works for most apps
C≤ 4,000 msNOTICEABLE — users may feel delay
D≤ 8,000 msPOOR — degrades UX
F> 8,000 msBLOCKED — broken or blocking flow
MODEL   -> gpt-4o
LATENCY -> 5.2s

RESULT  -> Latency Grade: D

Even if cost is optimized, a D-grade latency will degrade user experience.

Common causes of poor latency:

  • Overuse of frontier models
  • Long context windows
  • Tool / agent chaining
  • No streaming enabled
  • Cold-start private models

How to Debug Poor Latency

Look at the slowest step first. A trace with one slow run step usually needs a faster model or streaming. A trace with many medium-speed run steps usually needs fewer model turns, better tool design, or a tighter agent loop.

SymptomLikely causeFirst fix
One slow run stepHeavy model or large contextTry a faster tier or reduce prompt size
Many run stepsAgent/tool loopAdd a max-iteration guard
Slow private modelCold start or overloaded GPUWarm the model or add capacity
Good model latency, poor UXApp/network work outside OlyxCheck your product-side timing separately

Putting It Together

These grades are meant to be read together:

Waste GradeLatency GradeWhat it meansSuggested action
ACCheap, but slowTUNE — optimize model speed or enable streaming
FAFast, but overpayingROUTING — add cheaper model tiers
BBBalanced, but not optimalWATCH — review high-volume endpoints

Trace Economics

Every trace detail response includes a fully attributed cost breakdown. Think of this as the cost object for one user-visible action.

{
  "summary": {
    "total_cost": 0.00545,
    "optimization_grade": "B",
    "grades": {
      "overall":  "B",
      "waste":    "A",
      "latency":  "B"
    },
    "by_model": {
      "gpt-4o": 0.00545
    },
    "by_infrastructure": {
      "public_cloud": 0.00545,
      "private": 0.00000
    }
  }
}

Field Reference

FieldMeaningUse it for
summary.total_costSum of all recorded step costsPer-request cost
summary.optimization_gradeComposite gradeFast triage
summary.grades.wasteModel-selection efficiencyRouting cleanup
summary.grades.latencyModel execution speedUX and performance work
summary.by_modelCost grouped by model identifierFinding expensive models
summary.by_infrastructureCost grouped into public/private bucketsPublic-vs-private spend
summary.revenueOptional revenue supplied by your appUnit economics
summary.gross_marginRevenue minus costMargin analysis

How to Use the Summary

  1. Identify expensive models.
"by_model": {
  "gpt-4o": 0.00545
}

Route simpler requests to cheaper alternatives.

  1. Detect infrastructure concentration risk.
"by_infrastructure": {
  "public_cloud": 0.00545,
  "private": 0.00000
}

Add fallback or private models if all sensitive or high-volume traffic is concentrated in one bucket.

  1. Track optimization over time.

Use the list endpoint for a recent sample and the detail endpoint when you need step-level evidence:

GET /api/v1/traces?per_page=100
Authorization: Bearer ak_<key_id>.<secret>
GET /api/v1/traces/:id
Authorization: Bearer ak_<key_id>.<secret>

In closed beta, keep reporting jobs simple: fetch recent traces, ignore traces that are still pending, and group the remaining rows by grade, model, feature, or tenant metadata.

Monitor

Track these three signals over time to know whether your routing configuration is improving:

SignalHealthy directionStale warning
Waste grade distributionFewer D/F grades week-over-weekAverage grade unchanged after routing changes
Latency grade distributionShifting toward A/B over timep95 latency rising despite no model-tier change
Cost per requestFlat or decreasing as volume scalesRising cost-per-request without added complexity

Fetch a recent window of traces and check the distribution yourself:

# Ruby — sample recent trace summaries
response = client.traces.list(per_page: 100)
completed = Array(response["data"]).select { |trace| trace["status"] == "completed" }

waste_dist = completed
  .map { |trace| trace.dig("grades", "waste") }
  .compact
  .tally
# => { "A" => 42, "B" => 31, "C" => 18, "D" => 6, "F" => 3 }
# Python — sample recent trace summaries
from collections import Counter

traces = client.traces.list(per_page=100)
completed = [trace for trace in traces if trace.status == "completed"]

waste_grades = [
    trace.raw.get("grades", {}).get("waste")
    for trace in completed
    if trace.raw.get("grades", {}).get("waste")
]
waste_dist = Counter(waste_grades)
# Counter({'A': 42, 'B': 31, 'C': 18, 'D': 6, 'F': 3})

A healthy project shows the A/B share growing as a proportion of total requests over weeks. If D/F grades dominate after you’ve configured routing tiers, the analyzer may be misclassifying request complexity — check your Grading Baseline in Settings and confirm your model tiers cover both simple and complex tasks.

Request Margin

If you charge users per request, Olyx can compute margin per trace. Pass the revenue you earned when you create the trace. Olyx stores that value and calculates gross_margin when you inspect the completed trace.

Use whatever revenue unit matches your product:

Product modelRevenue value to attach
Per-request billingThe exact amount charged for this request
Usage creditsThe dollar value of credits consumed
Subscription planAn allocated per-action value, or omit revenue until you have an allocation model
Internal toolAn internal chargeback value, if your team uses one

Attach revenue at execution time:

POST /api/v1/traces
Authorization: Bearer ak_<key_id>.<secret>
Content-Type: application/json

{
  "revenue": 0.50,
  "metadata": {
    "user_id": "user_123"
  }
}

Response:

{
  "id": "tr_abc123",
  "status": "pending",
  "created_at": "2026-05-14T12:00:00Z",
  "metadata": {
    "user_id": "user_123"
  },
  "revenue": 0.50
}

After the model call completes, complete the trace. The complete response includes the computed summary, so common request-margin flows do not need an extra read.

PATCH /api/v1/traces/tr_abc123/complete
Authorization: Bearer ak_<key_id>.<secret>
{
  "id": "tr_abc123",
  "status": "completed",
  "total_cost": 0.00318,
  "summary": {
    "total_cost":   0.00318,
    "revenue":      0.50,
    "gross_margin": 0.49682,
    "grades": {
      "overall": "B",
      "waste": "A",
      "latency": "B"
    }
  }
}

Use GET /api/v1/traces/:id when you also need the full step list, graph, or routing decision.

Gross margin is intentionally simple:

Gross Margin = Revenue - Total Cost

It does not include salaries, support time, taxes, payment processing fees, or your provider’s final invoice adjustments. Use it as a request-level signal, not a complete finance ledger.


Why This Matters

Cost intelligence shifts the focus from managing a budget to managing unit economics. This granularity changes your engineering and product strategy in three critical ways:

  • Per-request profitability: By passing revenue in your trace metadata, Olyx calculates the gross margin of every individual user interaction. You can identify exactly which user segments or specific features are subsidizing others.

  • Dynamic model routing: You can implement logic that optimizes for margin. If a request is low-priority or has low revenue potential, the system can automatically route it to a more cost-effective model without manual intervention.

  • Agent behavior control: Autonomous agents can inadvertently enter “thought loops” that consume thousands of tokens in seconds. Cost intelligence acts as a circuit breaker, detecting these patterns and terminating the trace before it hits a budget cap.


Common Patterns in Practice

PatternOperational SignalDecision Trigger
SaaS Pricing ValidationMargin falls below target (e.g., < 70%).Adjust subscription tiers or switch to a more efficient model.
Detect Runaway AgentsTrace shows high cost frequency with low output variation.Trigger a loop-detection event and alert the developer.
Optimize Tier Routing”Waste Grade” is high on simple tasks.Reconfigure routing tiers so small tasks do not hit the most expensive models.

The Mental Model

Olyx standardizes the lifecycle of an AI request into a linear economic flow:

INPUT        -> tokens, intent, metadata
EXECUTION    -> model, provider, infrastructure path
COST         -> configured rates multiplied by usage
MARGIN       -> optional revenue minus total cost
TRACE        -> audit record for debugging and optimization

Instead of viewing LLM spend as an unpredictable overhead, it becomes a line item with a measurable Return on Investment (ROI). You are no longer guessing whether a specific model upgrade was “worth it”; the data shows the impact on your bottom line immediately.

The Key Shift

  • Without Olyx: LLM costs are a black box that you discover at the end of the month when the provider invoice arrives.

  • With Olyx: Every governed request has a trace, a cost, a latency profile, and a routing decision you can inspect.

Was this page helpful?