Replays

A replay creates a new trace from an existing trace so you can compare an alternative model path against historical traffic. The source trace is never modified, and the replay links back through metadata.source_trace_id.

Use replays when you want to answer operational questions before changing production routing:

QuestionReplay helps you inspect
Could a cheaper model handle this workload?Cost, latency, grades, and models used on the replay trace
Did a routing change improve behavior?Source vs. replay summaries with the same source input
Which model should handle this class of task?Multi-model comparison across registered model identifiers
Did a provider or model migration change outcomes?New trace output and step behavior against the old source trace
Why did a trace become expensive?Run-step cost, fanout behavior, and replayed model choices

During closed beta, treat replay output as an operational comparison tool, not as a contractual quality benchmark or billing record. Multi-model comparisons simulate from the original run steps and configured model rates. Single-model replay can try live execution when the gateway allows it and can fall back to simulation if needed.

What Replays Do

Think of a replay as a source trace plus a new replay trace.

BehaviorSource traceReplay trace
Trace identityOriginal production traceNew trace with status: "replay"
InputHistorical input already captured on the source traceReused from the source trace
Safety checkOriginal recorded check resultRe-applied with the current guardrail engine before later steps continue
Model choiceWhatever ran originallyOverridden by force_model, simulated across compare_models, or copied when no override is provided
Cost and latencyOriginal recorded valuesRecomputed or simulated for the replay path, subject to max_cost
Original dataNot modifiedLinked by metadata.source_trace_id
flowchart LR SOURCE[01 SOURCE TRACE] JOB[02 REPLAY JOB] GUARD[03 CURRENT GUARDRAILS] MODEL[04 MODEL PATH] COMPARE[05 COMPARISON] SOURCE -->|read input steps cost latency| JOB JOB -->|new replay trace| GUARD GUARD -->|allowed request| MODEL MODEL -->|forced copied or compared| COMPARE COMPARE -->|cost latency grades models| RESULT[REPLAY RESULT]

Replays are not a substitute for live production traffic or compliance re-certification. They help answer whether the alternative path is worth testing further.

Starting a Replay

Start a replay with POST /api/v1/replay. The minimum useful request contains the source trace_id; add force_model when you want one replacement model and max_cost when you want to stop expensive run-like steps.

POST /api/v1/replay
Authorization: Bearer ak_<key_id>.<secret>
Content-Type: application/json

{
  "trace_id": "550e8400-e29b-41d4-a716-446655440000",
  "force_model": "gpt-4o-mini",
  "max_cost": 0.005
}

The endpoint is asynchronous. When no cached result exists, it queues a job and returns 202 Accepted; the important value is job_id, which your client stores temporarily and sends to the polling endpoint.

{
  "job_id": "a3f9c1d8e72b9f85dd40c123",
  "source_trace_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "queued",
  "overrides": {
    "force_model": "gpt-4o-mini",
    "max_cost": 0.005
  }
}

If the same trace and same overrides were completed recently, the endpoint can return 200 OK with the cached comparison immediately. Clients can treat that response as the final result and skip polling.

{
  "status": "completed",
  "source_trace_id": "550e8400-e29b-41d4-a716-446655440000",
  "replay_trace_id": "replay_4a2c9f",
  "overrides": {
    "force_model": "gpt-4o-mini"
  },
  "comparison": {
    "source": {
      "total_cost": 0.012,
      "optimization_grade": "B",
      "grades": {
        "overall": "B",
        "waste": "C",
        "latency": "B"
      },
      "total_latency_ms": 1240.0,
      "models_used": ["gpt-4o"]
    },
    "replay": {
      "total_cost": 0.003,
      "optimization_grade": "A",
      "grades": {
        "overall": "A",
        "waste": "A",
        "latency": "A"
      },
      "total_latency_ms": 870.0,
      "models_used": ["gpt-4o-mini"]
    }
  }
}

Override Options

Use this table as the request contract for replay experiments. Start with one override, verify the comparison, then add cost guards or fanout behavior when you need tighter control.

FieldTypeDescription
force_modelstringReplay run-like steps with one target model identifier. Use this for a side-by-side comparison.
compare_modelsstring arrayCreate one replay path per model and return a multi-model comparison.
force_modelsstring arrayOverride the model list for fanout-style steps.
max_costnumberBlock replay run-like steps whose estimated cost exceeds this USD value.

force_model and compare_models are mutually exclusive. Use force_model when you want one replay trace you can inspect step by step. Use compare_models when you want a comparison table across several candidates.

Error Responses

Replay errors are intentionally narrow: they either mean the source trace cannot be used or the override payload needs correction before a job can be queued.

CodeErrorCause
404Resource not foundtrace_id does not exist or belongs to another customer.
422invalid_overrideOverrides are inconsistent, unknown, or invalid.

For validation errors, the message field should be clear enough for developer tooling or logs to show what needs to change before the replay can be queued.

{
  "error": "invalid_override",
  "code": "invalid_override",
  "message": "force_model and compare_models are mutually exclusive"
}

Polling Job Status

Poll the returned job_id until the replay completes or fails.

flowchart LR CREATE[POST REPLAY] QUEUED[202 QUEUED] RUNNING[POLL RUNNING] DONE[COMPLETED COMPARISON] FAILED[FAILED ERROR] CACHED[200 CACHED RESULT] CREATE --> QUEUED QUEUED --> RUNNING RUNNING --> DONE RUNNING --> FAILED CREATE --> CACHED

Use the job_id from the create response in this request. Polling is read-only: it does not enqueue another replay and it does not change the source or replay trace.

GET /api/v1/replay/a3f9c1d8e72b9f85dd40c123
Authorization: Bearer ak_<key_id>.<secret>

Queued and running jobs return a small status object. Keep polling while the status is queued or running, usually with a short delay between attempts.

{
  "status": "running"
}

A completed single-model replay includes both the source summary and replay summary. comparison.source is the baseline from the original trace, while comparison.replay is the newly generated replay path.

{
  "status": "completed",
  "source_trace_id": "550e8400-e29b-41d4-a716-446655440000",
  "replay_trace_id": "replay_4a2c9f",
  "overrides": {
    "force_model": "gpt-4o-mini"
  },
  "comparison": {
    "source": {
      "total_cost": 0.012,
      "optimization_grade": "B",
      "grades": {
        "overall": "B",
        "waste": "C",
        "latency": "B"
      },
      "total_latency_ms": 1240.0,
      "models_used": ["gpt-4o"]
    },
    "replay": {
      "total_cost": 0.003,
      "optimization_grade": "A",
      "grades": {
        "overall": "A",
        "waste": "A",
        "latency": "A"
      },
      "total_latency_ms": 870.0,
      "models_used": ["gpt-4o-mini"]
    }
  }
}

A completed multi-model comparison returns comparison.replays instead of a single comparison.replay object. Each item in that array represents one candidate model, which keeps the comparison table normalized.

{
  "status": "completed",
  "source_trace_id": "550e8400-e29b-41d4-a716-446655440000",
  "overrides": {
    "compare_models": ["gpt-4o-mini", "gpt-3.5-turbo", "claude-haiku-4-5-20251001"]
  },
  "comparison": {
    "source": {
      "total_cost": 0.012,
      "optimization_grade": "B",
      "grades": {
        "overall": "B",
        "waste": "C",
        "latency": "B"
      },
      "total_latency_ms": 1240.0,
      "models_used": ["gpt-4o"]
    },
    "replays": [
      {
        "model": "gpt-4o-mini",
        "total_cost": 0.003,
        "optimization_grade": "A",
        "grades": {
          "overall": "A",
          "waste": "A",
          "latency": "A"
        },
        "total_latency_ms": 870.0,
        "models_used": ["gpt-4o-mini"]
      },
      {
        "model": "gpt-3.5-turbo",
        "total_cost": 0.001,
        "optimization_grade": "A",
        "grades": {
          "overall": "A",
          "waste": "A",
          "latency": "B"
        },
        "total_latency_ms": 1040.0,
        "models_used": ["gpt-3.5-turbo"]
      }
    ]
  }
}

Failures are returned from the polling endpoint. The job has reached a terminal state, so clients should stop polling and surface the error value.

{
  "status": "failed",
  "error": "Source trace not found"
}

An unknown job id returns 404. This usually means the job id was typed incorrectly, expired from the short-lived job store, or belongs to a different project context.

{
  "error": "Job not found"
}

Single-Model Replay

Single-model replay is the normal workflow when you want to inspect the replay trace itself.

The TypeScript SDK wraps the create and polling endpoints. client.replays.create may return a completed cache hit immediately; otherwise client.replays.wait polls until the job is completed or failed.

import Olyx from "@olyx-labs/olyx";

const client = new Olyx({ apiKey: process.env.OLYX_API_KEY! });

const job = await client.replays.create({
  traceId: "550e8400-e29b-41d4-a716-446655440000",
  forceModel: "gpt-4o-mini",
  maxCost: 0.005,
});

const result = job.data.status === "completed"
  ? job
  : await client.replays.wait(job.data.jobId!);

console.log(result.data.comparison?.replay?.totalCost);
console.log(result.data.comparison?.replay?.optimizationGrade);

The Python SDK follows the same resource shape: create the replay job, use a completed cache hit immediately, or wait on the returned job_id when the backend queues work.

import os
import olyx

client = olyx.Client(api_key=os.environ["OLYX_API_KEY"])

job = client.replays.create(
    trace_id="550e8400-e29b-41d4-a716-446655440000",
    force_model="gpt-4o-mini",
    max_cost=0.005,
)

result = job if job.status == "completed" else client.replays.wait(job.job_id)

print(result.comparison["replay"]["total_cost"])
print(result.comparison["replay"]["optimization_grade"])

The Ruby SDK follows the same flow. Create the job, check whether it already completed, then wait on job_id only when the server queued asynchronous work.

job = client.replays.create(
  trace_id:    "550e8400-e29b-41d4-a716-446655440000",
  force_model: "gpt-4o-mini",
  max_cost:    0.005
)

result = job["status"] == "completed" ? job : client.replays.wait(job.job_id)

puts result.dig("comparison", "replay", "total_cost")
puts result.dig("comparison", "replay", "optimization_grade")

Multi-Model Comparison

Use compare_models when you want to benchmark the same trace against several registered models in one job. This is best for model migration work because the result gives you a normalized comparison across cost, latency, grades, and models used.

flowchart LR SOURCE[SOURCE TRACE] FANOUT[COMPARE MODELS] M1[GPT 4O MINI] M2[GPT 3 5 TURBO] M3[CLAUDE HAIKU] TABLE[COMPARISON TABLE] DECISION[ROUTING DECISION] SOURCE --> FANOUT FANOUT --> M1 FANOUT --> M2 FANOUT --> M3 M1 --> TABLE M2 --> TABLE M3 --> TABLE TABLE -->|cost latency grades| DECISION

The raw HTTP request uses the same endpoint as single-model replay, but switches from force_model to compare_models. Use registered model identifiers here; the replay engine compares each candidate against the same source trace so the metrics line up.

POST /api/v1/replay
Authorization: Bearer ak_<key_id>.<secret>
Content-Type: application/json

{
  "trace_id": "550e8400-e29b-41d4-a716-446655440000",
  "compare_models": ["gpt-4o-mini", "gpt-3.5-turbo", "claude-haiku-4-5-20251001"],
  "max_cost": 0.01
}

In SDK code, the comparison result is still asynchronous. After the job finishes, iterate over comparison.replays to render rows, choose a candidate for deeper inspection, or send the result into an internal migration review.

const job = await client.replays.create({
  traceId: "550e8400-e29b-41d4-a716-446655440000",
  compareModels: ["gpt-4o-mini", "gpt-3.5-turbo", "claude-haiku-4-5-20251001"],
  maxCost: 0.01,
});

const result = job.data.status === "completed"
  ? job
  : await client.replays.wait(job.data.jobId!);

for (const replay of result.data.comparison?.replays ?? []) {
  console.log(`${replay.model}: $${replay.totalCost} (${replay.optimizationGrade})`);
}

Python uses the same compare_models field as Ruby and TypeScript. When the job completes, iterate over comparison.replays; each row is one candidate model scored against the same source trace.

import os
import olyx

client = olyx.Client(api_key=os.environ["OLYX_API_KEY"])

job = client.replays.create(
    trace_id="550e8400-e29b-41d4-a716-446655440000",
    compare_models=[
        "gpt-4o-mini",
        "gpt-3.5-turbo",
        "claude-haiku-4-5-20251001",
    ],
    max_cost=0.01,
)

result = job if job.status == "completed" else client.replays.wait(job.job_id)

for replay in result.comparison["replays"]:
    print(f"{replay['model']}: ${replay['total_cost']} ({replay['optimization_grade']})")

Ruby returns the same data shape using snake_case request fields and JSON-style response keys.

job = client.replays.create(
  trace_id:       "550e8400-e29b-41d4-a716-446655440000",
  compare_models: ["gpt-4o-mini", "gpt-3.5-turbo", "claude-haiku-4-5-20251001"],
  max_cost:       0.01
)

result = job["status"] == "completed" ? job : client.replays.wait(job.job_id)

result.dig("comparison", "replays").each do |replay|
  puts "#{replay["model"]}: $#{replay["total_cost"]} (#{replay["optimization_grade"]})"
end

The Replay dashboard page renders multi-model comparisons with per-metric deltas and a suggested best candidate. Treat the recommendation as a starting point: inspect output quality and safety behavior before routing production traffic to the new model.

Comparison Fields

Use these fields for dashboards, migration reports, and automated candidate ranking. The same field names appear in single-model and multi-model responses, except for model, which only appears on rows inside comparison.replays.

FieldDescription
total_costEstimated USD cost across replayed run-like steps.
optimization_gradeComposite letter grade for cost and latency efficiency.
gradesDetailed grade object, usually including overall, waste, and latency.
total_latency_msSum of replayed run-step latency.
models_usedActual model identifiers recorded on the replay trace.
modelPresent only in comparison.replays[]; identifies the candidate represented by that row.

Manual Polling

Use manual polling when you want tighter control over timeout, UI progress state, or cancellation behavior. This is the lower-level version of wait.

const job = await client.replays.create({
  traceId: "550e8400-e29b-41d4-a716-446655440000",
  forceModel: "gpt-4o-mini",
});

if (job.data.status === "queued" && job.data.jobId) {
  let current = await client.replays.poll(job.data.jobId);

  while (current.data.status === "queued" || current.data.status === "running") {
    await new Promise((resolve) => setTimeout(resolve, 1500));
    current = await client.replays.poll(job.data.jobId);
  }

  if (current.data.status === "failed") {
    throw new Error(current.data.error ?? "Replay failed");
  }

  console.log(current.data.comparison);
}

The Ruby example is intentionally explicit: it keeps polling until completion, raises on failure, and sleeps between requests so the client does not hammer the API.

job = client.replays.create(
  trace_id:    "550e8400-e29b-41d4-a716-446655440000",
  force_model: "gpt-4o-mini"
)

unless job["status"] == "completed"
  loop do
    current = client.replays.poll(job.job_id)
    break puts current.dig("comparison").inspect if current["status"] == "completed"
    raise current["error"] if current["status"] == "failed"
    sleep 1.5
  end
end

Cost Caps

max_cost blocks replay run-like steps whose estimated cost exceeds your threshold. Use it while exploring cheaper model alternatives or benchmarking a large source trace. It is still a replay guard, not a production billing control: use project spend limits, model registry rates, and provider-side controls for hard production budget enforcement.

flowchart TD STEP[RUN LIKE STEP] ESTIMATE[ESTIMATE CANDIDATE COST] CHECK{WITHIN MAX COST} CONTINUE[CONTINUE REPLAY] BLOCK[RECORD BLOCKED STEP] RESULT[INCLUDE IN REPLAY TRACE] STEP --> ESTIMATE ESTIMATE --> CHECK CHECK -->|yes| CONTINUE CHECK -->|no| BLOCK CONTINUE --> RESULT BLOCK --> RESULT

This request tests a fanout-style replay path under a small budget. force_models supplies the candidate set and max_cost gives the replay engine permission to block any candidate step whose estimate is too high.

POST /api/v1/replay
Authorization: Bearer ak_<key_id>.<secret>
Content-Type: application/json

{
  "trace_id": "550e8400-e29b-41d4-a716-446655440000",
  "force_models": ["gpt-4o-mini", "gpt-4o"],
  "max_cost": 0.002
}

Steps blocked by the cost guard are recorded in the replay trace. A blocked step is still observable, even though the replay did not execute or simulate that candidate all the way through.

{
  "type": "blocked",
  "output": {
    "reason": "max_cost_exceeded",
    "model": "gpt-4o",
    "estimated_cost": 0.0031,
    "limit": 0.002
  }
}

If every fanout candidate is pruned by the budget guard, the job can fail and return the error through the polling endpoint. That failure is useful: it tells you the candidate set is too expensive for the guard you chose.

Caching

Completed replay results are cached for one hour using the source trace id and override payload. Submitting the same replay within that window can return the completed result immediately.

Caching is useful for dashboards and docs examples because a user can refresh without enqueueing duplicate replay work. Change any override value to request a fresh comparison.

Replays vs. New Traces

A replay and a new trace both produce an operational record, but they answer different questions. A new trace captures work your application is doing now. A replay starts from a trace you already captured and asks how that same work would behave under a different model path, budget guard, or routing configuration.

The practical rule is: create a new trace for every new user-visible action; create a replay when you are evaluating an old action under a proposed change. That keeps production observability separate from migration analysis.

DimensionReplayNew Trace
Primary useCompare a historical trace against an alternative model path.Record new production traffic.
InputReused from an existing trace.Sent by your application now.
RoutingCopied, overridden, or simulated depending on overrides.Uses current project routing.
Source traceNever modified; linked through metadata.source_trace_id.Not applicable; this is the original record.
Output inspectionUseful for model migration, regression analysis, and cost experiments.Useful for live product behavior and debugging.
Cost reportingOperational estimate for comparison.Operational trace cost for current traffic.

Use replays to choose what to test next. After you change production routing, use new traces to observe what actually happened for live requests.

For the deeper contract details, see Override Options, Error Responses, Comparison Fields, Manual Polling, Cost Caps, and Caching.

Was this page helpful?