Replays
A replay creates a new trace from an existing trace so you can compare an alternative model path against historical
traffic. The source trace is never modified, and the replay links back through metadata.source_trace_id.
Use replays when you want to answer operational questions before changing production routing:
| Question | Replay helps you inspect |
|---|---|
| Could a cheaper model handle this workload? | Cost, latency, grades, and models used on the replay trace |
| Did a routing change improve behavior? | Source vs. replay summaries with the same source input |
| Which model should handle this class of task? | Multi-model comparison across registered model identifiers |
| Did a provider or model migration change outcomes? | New trace output and step behavior against the old source trace |
| Why did a trace become expensive? | Run-step cost, fanout behavior, and replayed model choices |
During closed beta, treat replay output as an operational comparison tool, not as a contractual quality benchmark or billing record. Multi-model comparisons simulate from the original run steps and configured model rates. Single-model replay can try live execution when the gateway allows it and can fall back to simulation if needed.
What Replays Do
Think of a replay as a source trace plus a new replay trace.
| Behavior | Source trace | Replay trace |
|---|---|---|
| Trace identity | Original production trace | New trace with status: "replay" |
| Input | Historical input already captured on the source trace | Reused from the source trace |
| Safety check | Original recorded check result | Re-applied with the current guardrail engine before later steps continue |
| Model choice | Whatever ran originally | Overridden by force_model, simulated across compare_models, or copied when no override is provided |
| Cost and latency | Original recorded values | Recomputed or simulated for the replay path, subject to max_cost |
| Original data | Not modified | Linked by metadata.source_trace_id |
Replays are not a substitute for live production traffic or compliance re-certification. They help answer whether the alternative path is worth testing further.
Starting a Replay
Start a replay with POST /api/v1/replay. The minimum useful request contains the source trace_id; add
force_model when you want one replacement model and max_cost when you want to stop expensive run-like steps.
POST /api/v1/replay
Authorization: Bearer ak_<key_id>.<secret>
Content-Type: application/json
{
"trace_id": "550e8400-e29b-41d4-a716-446655440000",
"force_model": "gpt-4o-mini",
"max_cost": 0.005
}
The endpoint is asynchronous. When no cached result exists, it queues a job and returns 202 Accepted; the important
value is job_id, which your client stores temporarily and sends to the polling endpoint.
{
"job_id": "a3f9c1d8e72b9f85dd40c123",
"source_trace_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "queued",
"overrides": {
"force_model": "gpt-4o-mini",
"max_cost": 0.005
}
}
If the same trace and same overrides were completed recently, the endpoint can return 200 OK with the cached
comparison immediately. Clients can treat that response as the final result and skip polling.
{
"status": "completed",
"source_trace_id": "550e8400-e29b-41d4-a716-446655440000",
"replay_trace_id": "replay_4a2c9f",
"overrides": {
"force_model": "gpt-4o-mini"
},
"comparison": {
"source": {
"total_cost": 0.012,
"optimization_grade": "B",
"grades": {
"overall": "B",
"waste": "C",
"latency": "B"
},
"total_latency_ms": 1240.0,
"models_used": ["gpt-4o"]
},
"replay": {
"total_cost": 0.003,
"optimization_grade": "A",
"grades": {
"overall": "A",
"waste": "A",
"latency": "A"
},
"total_latency_ms": 870.0,
"models_used": ["gpt-4o-mini"]
}
}
}
Override Options
Use this table as the request contract for replay experiments. Start with one override, verify the comparison, then add cost guards or fanout behavior when you need tighter control.
| Field | Type | Description |
|---|---|---|
force_model | string | Replay run-like steps with one target model identifier. Use this for a side-by-side comparison. |
compare_models | string array | Create one replay path per model and return a multi-model comparison. |
force_models | string array | Override the model list for fanout-style steps. |
max_cost | number | Block replay run-like steps whose estimated cost exceeds this USD value. |
force_model and compare_models are mutually exclusive. Use force_model when you want one replay trace you can
inspect step by step. Use compare_models when you want a comparison table across several candidates.
Error Responses
Replay errors are intentionally narrow: they either mean the source trace cannot be used or the override payload needs correction before a job can be queued.
| Code | Error | Cause |
|---|---|---|
404 | Resource not found | trace_id does not exist or belongs to another customer. |
422 | invalid_override | Overrides are inconsistent, unknown, or invalid. |
For validation errors, the message field should be clear enough for developer tooling or logs to show what needs to
change before the replay can be queued.
{
"error": "invalid_override",
"code": "invalid_override",
"message": "force_model and compare_models are mutually exclusive"
}
Polling Job Status
Poll the returned job_id until the replay completes or fails.
Use the job_id from the create response in this request. Polling is read-only: it does not enqueue another replay and
it does not change the source or replay trace.
GET /api/v1/replay/a3f9c1d8e72b9f85dd40c123
Authorization: Bearer ak_<key_id>.<secret>
Queued and running jobs return a small status object. Keep polling while the status is queued or running, usually
with a short delay between attempts.
{
"status": "running"
}
A completed single-model replay includes both the source summary and replay summary. comparison.source is the
baseline from the original trace, while comparison.replay is the newly generated replay path.
{
"status": "completed",
"source_trace_id": "550e8400-e29b-41d4-a716-446655440000",
"replay_trace_id": "replay_4a2c9f",
"overrides": {
"force_model": "gpt-4o-mini"
},
"comparison": {
"source": {
"total_cost": 0.012,
"optimization_grade": "B",
"grades": {
"overall": "B",
"waste": "C",
"latency": "B"
},
"total_latency_ms": 1240.0,
"models_used": ["gpt-4o"]
},
"replay": {
"total_cost": 0.003,
"optimization_grade": "A",
"grades": {
"overall": "A",
"waste": "A",
"latency": "A"
},
"total_latency_ms": 870.0,
"models_used": ["gpt-4o-mini"]
}
}
}
A completed multi-model comparison returns comparison.replays instead of a single comparison.replay object. Each
item in that array represents one candidate model, which keeps the comparison table normalized.
{
"status": "completed",
"source_trace_id": "550e8400-e29b-41d4-a716-446655440000",
"overrides": {
"compare_models": ["gpt-4o-mini", "gpt-3.5-turbo", "claude-haiku-4-5-20251001"]
},
"comparison": {
"source": {
"total_cost": 0.012,
"optimization_grade": "B",
"grades": {
"overall": "B",
"waste": "C",
"latency": "B"
},
"total_latency_ms": 1240.0,
"models_used": ["gpt-4o"]
},
"replays": [
{
"model": "gpt-4o-mini",
"total_cost": 0.003,
"optimization_grade": "A",
"grades": {
"overall": "A",
"waste": "A",
"latency": "A"
},
"total_latency_ms": 870.0,
"models_used": ["gpt-4o-mini"]
},
{
"model": "gpt-3.5-turbo",
"total_cost": 0.001,
"optimization_grade": "A",
"grades": {
"overall": "A",
"waste": "A",
"latency": "B"
},
"total_latency_ms": 1040.0,
"models_used": ["gpt-3.5-turbo"]
}
]
}
}
Failures are returned from the polling endpoint. The job has reached a terminal state, so clients should stop polling
and surface the error value.
{
"status": "failed",
"error": "Source trace not found"
}
An unknown job id returns 404. This usually means the job id was typed incorrectly, expired from the short-lived job
store, or belongs to a different project context.
{
"error": "Job not found"
}
Single-Model Replay
Single-model replay is the normal workflow when you want to inspect the replay trace itself.
The TypeScript SDK wraps the create and polling endpoints. client.replays.create may return a completed cache hit
immediately; otherwise client.replays.wait polls until the job is completed or failed.
import Olyx from "@olyx-labs/olyx";
const client = new Olyx({ apiKey: process.env.OLYX_API_KEY! });
const job = await client.replays.create({
traceId: "550e8400-e29b-41d4-a716-446655440000",
forceModel: "gpt-4o-mini",
maxCost: 0.005,
});
const result = job.data.status === "completed"
? job
: await client.replays.wait(job.data.jobId!);
console.log(result.data.comparison?.replay?.totalCost);
console.log(result.data.comparison?.replay?.optimizationGrade);
The Python SDK follows the same resource shape: create the replay job, use a completed cache hit immediately, or wait
on the returned job_id when the backend queues work.
import os
import olyx
client = olyx.Client(api_key=os.environ["OLYX_API_KEY"])
job = client.replays.create(
trace_id="550e8400-e29b-41d4-a716-446655440000",
force_model="gpt-4o-mini",
max_cost=0.005,
)
result = job if job.status == "completed" else client.replays.wait(job.job_id)
print(result.comparison["replay"]["total_cost"])
print(result.comparison["replay"]["optimization_grade"])
The Ruby SDK follows the same flow. Create the job, check whether it already completed, then wait on job_id only
when the server queued asynchronous work.
job = client.replays.create(
trace_id: "550e8400-e29b-41d4-a716-446655440000",
force_model: "gpt-4o-mini",
max_cost: 0.005
)
result = job["status"] == "completed" ? job : client.replays.wait(job.job_id)
puts result.dig("comparison", "replay", "total_cost")
puts result.dig("comparison", "replay", "optimization_grade")
Multi-Model Comparison
Use compare_models when you want to benchmark the same trace against several registered models in one job. This is
best for model migration work because the result gives you a normalized comparison across cost, latency, grades, and
models used.
The raw HTTP request uses the same endpoint as single-model replay, but switches from force_model to
compare_models. Use registered model identifiers here; the replay engine compares each candidate against the same
source trace so the metrics line up.
POST /api/v1/replay
Authorization: Bearer ak_<key_id>.<secret>
Content-Type: application/json
{
"trace_id": "550e8400-e29b-41d4-a716-446655440000",
"compare_models": ["gpt-4o-mini", "gpt-3.5-turbo", "claude-haiku-4-5-20251001"],
"max_cost": 0.01
}
In SDK code, the comparison result is still asynchronous. After the job finishes, iterate over comparison.replays to
render rows, choose a candidate for deeper inspection, or send the result into an internal migration review.
const job = await client.replays.create({
traceId: "550e8400-e29b-41d4-a716-446655440000",
compareModels: ["gpt-4o-mini", "gpt-3.5-turbo", "claude-haiku-4-5-20251001"],
maxCost: 0.01,
});
const result = job.data.status === "completed"
? job
: await client.replays.wait(job.data.jobId!);
for (const replay of result.data.comparison?.replays ?? []) {
console.log(`${replay.model}: $${replay.totalCost} (${replay.optimizationGrade})`);
}
Python uses the same compare_models field as Ruby and TypeScript. When the job completes, iterate over
comparison.replays; each row is one candidate model scored against the same source trace.
import os
import olyx
client = olyx.Client(api_key=os.environ["OLYX_API_KEY"])
job = client.replays.create(
trace_id="550e8400-e29b-41d4-a716-446655440000",
compare_models=[
"gpt-4o-mini",
"gpt-3.5-turbo",
"claude-haiku-4-5-20251001",
],
max_cost=0.01,
)
result = job if job.status == "completed" else client.replays.wait(job.job_id)
for replay in result.comparison["replays"]:
print(f"{replay['model']}: ${replay['total_cost']} ({replay['optimization_grade']})")
Ruby returns the same data shape using snake_case request fields and JSON-style response keys.
job = client.replays.create(
trace_id: "550e8400-e29b-41d4-a716-446655440000",
compare_models: ["gpt-4o-mini", "gpt-3.5-turbo", "claude-haiku-4-5-20251001"],
max_cost: 0.01
)
result = job["status"] == "completed" ? job : client.replays.wait(job.job_id)
result.dig("comparison", "replays").each do |replay|
puts "#{replay["model"]}: $#{replay["total_cost"]} (#{replay["optimization_grade"]})"
end
The Replay dashboard page renders multi-model comparisons with per-metric deltas and a suggested best candidate. Treat the recommendation as a starting point: inspect output quality and safety behavior before routing production traffic to the new model.
Comparison Fields
Use these fields for dashboards, migration reports, and automated candidate ranking. The same field names appear in
single-model and multi-model responses, except for model, which only appears on rows inside comparison.replays.
| Field | Description |
|---|---|
total_cost | Estimated USD cost across replayed run-like steps. |
optimization_grade | Composite letter grade for cost and latency efficiency. |
grades | Detailed grade object, usually including overall, waste, and latency. |
total_latency_ms | Sum of replayed run-step latency. |
models_used | Actual model identifiers recorded on the replay trace. |
model | Present only in comparison.replays[]; identifies the candidate represented by that row. |
Manual Polling
Use manual polling when you want tighter control over timeout, UI progress state, or cancellation behavior. This is the
lower-level version of wait.
const job = await client.replays.create({
traceId: "550e8400-e29b-41d4-a716-446655440000",
forceModel: "gpt-4o-mini",
});
if (job.data.status === "queued" && job.data.jobId) {
let current = await client.replays.poll(job.data.jobId);
while (current.data.status === "queued" || current.data.status === "running") {
await new Promise((resolve) => setTimeout(resolve, 1500));
current = await client.replays.poll(job.data.jobId);
}
if (current.data.status === "failed") {
throw new Error(current.data.error ?? "Replay failed");
}
console.log(current.data.comparison);
}
The Ruby example is intentionally explicit: it keeps polling until completion, raises on failure, and sleeps between requests so the client does not hammer the API.
job = client.replays.create(
trace_id: "550e8400-e29b-41d4-a716-446655440000",
force_model: "gpt-4o-mini"
)
unless job["status"] == "completed"
loop do
current = client.replays.poll(job.job_id)
break puts current.dig("comparison").inspect if current["status"] == "completed"
raise current["error"] if current["status"] == "failed"
sleep 1.5
end
end
Cost Caps
max_cost blocks replay run-like steps whose estimated cost exceeds your threshold. Use it while exploring cheaper
model alternatives or benchmarking a large source trace. It is still a replay guard, not a production billing control:
use project spend limits, model registry rates, and provider-side controls for hard production budget enforcement.
This request tests a fanout-style replay path under a small budget. force_models supplies the candidate set and
max_cost gives the replay engine permission to block any candidate step whose estimate is too high.
POST /api/v1/replay
Authorization: Bearer ak_<key_id>.<secret>
Content-Type: application/json
{
"trace_id": "550e8400-e29b-41d4-a716-446655440000",
"force_models": ["gpt-4o-mini", "gpt-4o"],
"max_cost": 0.002
}
Steps blocked by the cost guard are recorded in the replay trace. A blocked step is still observable, even though the replay did not execute or simulate that candidate all the way through.
{
"type": "blocked",
"output": {
"reason": "max_cost_exceeded",
"model": "gpt-4o",
"estimated_cost": 0.0031,
"limit": 0.002
}
}
If every fanout candidate is pruned by the budget guard, the job can fail and return the error through the polling endpoint. That failure is useful: it tells you the candidate set is too expensive for the guard you chose.
Caching
Completed replay results are cached for one hour using the source trace id and override payload. Submitting the same replay within that window can return the completed result immediately.
Caching is useful for dashboards and docs examples because a user can refresh without enqueueing duplicate replay work. Change any override value to request a fresh comparison.
Replays vs. New Traces
A replay and a new trace both produce an operational record, but they answer different questions. A new trace captures work your application is doing now. A replay starts from a trace you already captured and asks how that same work would behave under a different model path, budget guard, or routing configuration.
The practical rule is: create a new trace for every new user-visible action; create a replay when you are evaluating an old action under a proposed change. That keeps production observability separate from migration analysis.
| Dimension | Replay | New Trace |
|---|---|---|
| Primary use | Compare a historical trace against an alternative model path. | Record new production traffic. |
| Input | Reused from an existing trace. | Sent by your application now. |
| Routing | Copied, overridden, or simulated depending on overrides. | Uses current project routing. |
| Source trace | Never modified; linked through metadata.source_trace_id. | Not applicable; this is the original record. |
| Output inspection | Useful for model migration, regression analysis, and cost experiments. | Useful for live product behavior and debugging. |
| Cost reporting | Operational estimate for comparison. | Operational trace cost for current traffic. |
Use replays to choose what to test next. After you change production routing, use new traces to observe what actually happened for live requests.
For the deeper contract details, see Override Options, Error Responses, Comparison Fields, Manual Polling, Cost Caps, and Caching.