Performance

Olyx is designed to keep gateway work small compared with model latency, but closed beta performance numbers should be treated as measurement guidance, not contractual guarantees. Your observed latency depends on region placement, database connectivity, provider latency, tool execution, and whether a request uses an outbound private agent route.

The practical goal is simple: measure the full path your users experience, then use traces to separate model time, tool time, and Olyx-recorded workflow time.

Measurement Model

Think about latency in segments. Olyx can make the request observable, but it cannot remove provider latency, client network distance, or time your application spends inside tools.

flowchart LR CLIENT[CLIENT APP] GATEWAY[OLYX GATEWAY] PROVIDER[MODEL PROVIDER] TOOL[OPTIONAL TOOL OR MCP SERVER] TRACE[TRACE SUMMARY] CLIENT --> GATEWAY GATEWAY --> PROVIDER PROVIDER --> GATEWAY GATEWAY --> CLIENT CLIENT -. tool call .-> TOOL TOOL -. tool result .-> CLIENT GATEWAY --> TRACE

Use the table below to decide where to look before optimizing. The first question is not “is Olyx slow?” It is “which segment owns the extra time?”

Segment	What it includes
Client transit	Browser, server, or worker network time before the request reaches Olyx
Gateway work	Authentication, trace lookup, routing, core safety checks, and trace-step recording
Provider time	The selected model provider’s queueing, generation, and response time
Tool time	MCP server calls, internal API calls, database reads, or agent tool execution between turns
Trace completion	Summary calculation, grade calculation, and async analysis after the core work is recorded

Olyx does not currently expose a standalone gateway_overhead_ms field in every trace summary. During closed beta, measure gateway overhead with controlled comparisons: same region, same model, same prompt shape, and one variable changed at a time.

Gateway Work Per Request

Every governed execution does a small amount of work before the selected model path runs. These steps are expected to be stable in a warm reference environment, but they are still affected by database and cache health.

Step	Closed-beta expectation
API key authentication	Fast lookup when keys and project state are warm
Trace lookup	Project-scoped trace resolution before execution starts
Core checks	In-process request checks before provider dispatch
Routing decision	Model selection from configured routing tiers and project state
Trace-step recording	Core step written on the request path; heavier summaries run after completion

Cost calculation, optimization grades, and workflow summaries run when the trace is completed. That keeps the primary execution path focused on the work needed to run the request and preserve the audit trail.

Reading Trace Latency

For application code, measure the SDK path you actually ship. Create a trace, execute inside it, complete it, then read the summary fields. Completion is important because summary fields such as total_latency_ms, chain_depth, and tool_overhead_ms are most useful once the workflow is closed.

import Olyx from "@olyx-labs/olyx";

const client = new Olyx({ apiKey: process.env.OLYX_API_KEY! });

const trace = await client.traces.create({
  metadata: { feature: "perf_probe", userId: "u_123" },
});

try {
  await client.execute({
    traceId: trace.data.id,
    input: "Return a concise status summary.",
  });
} finally {
  await client.traces.complete(trace.data.id);
}

const details = await client.traces.find(trace.data.id);
const summary = details.data.summary;

console.log({
  totalLatencyMs: summary?.totalLatencyMs,
  chainDepth: summary?.chainDepth,
  toolOverheadMs: summary?.toolOverheadMs,
});

import os
import olyx

client = olyx.Olyx(
    api_key=os.environ["OLYX_API_KEY"],
    mock=False,
)

trace = client.traces.create(
    metadata={"feature": "perf_probe", "user_id": "u_123"}
)

try:
    client.execute(
        trace_id=trace.id,
        input="Return a concise status summary.",
    )
finally:
    client.traces.complete(trace.id)

details = client.traces.find(trace.id)
summary = details.summary or {}

print({
    "total_latency_ms": summary.get("total_latency_ms"),
    "chain_depth": summary.get("chain_depth"),
    "tool_overhead_ms": summary.get("tool_overhead_ms"),
})

client = Olyx.new

trace = client.traces.create(
  metadata: { feature: "perf_probe", user_id: "u_123" }
)

begin
  client.execute(
    trace_id: trace.id,
    input: "Return a concise status summary."
  )
ensure
  client.traces.complete(trace.id)
end

details = client.traces.find(trace.id)
summary = details.summary || {}

puts({
  total_latency_ms: summary["total_latency_ms"],
  chain_depth: summary["chain_depth"],
  tool_overhead_ms: summary["tool_overhead_ms"]
})

Use total_latency_ms for the full recorded trace path. Use individual trace steps when you need to see which model run or tool call added the delay.

MCP Workloads

MCP and tool-call workflows are multi-turn. They can be fast, but they have a different shape from a single prompt and response because each tool request adds application work and usually another model continuation.

sequenceDiagram participant App as App participant Olyx as Olyx participant Model as Model participant Tool as Tool App->>Olyx: execute with tools Olyx->>Model: model call Model-->>Olyx: tool call Olyx-->>App: tool calls pending App->>Tool: execute tool Tool-->>App: result App->>Olyx: continue with tool result Olyx->>Model: continuation Model-->>Olyx: final answer Olyx-->>App: output

For multi-turn tool chains, inspect chain_depth and tool_overhead_ms on completed traces.

Signal	How to read it
High `chain_depth` + high `tool_overhead_ms`	Tool execution is likely the bottleneck
High `chain_depth` + low `tool_overhead_ms`	Many fast tool calls; consider batching if the model loops
High `stall_probability`	Review termination conditions and repeated tool arguments

A stall_probability above 0.5 is a review signal, not a production incident by itself. Check the tool_call and tool_result steps for repeated calls with identical arguments.

Streaming

Olyx streams model output through the gateway path instead of waiting for the full response body. During closed beta, treat first-token latency as:

client transit + gateway work + provider first-token latency + response transit

If the user experience depends on perceived responsiveness, measure first-token time separately from full completion time. A page can feel fast when tokens start quickly even if the full answer takes longer.

Private Agent Routes

Private routes use an outbound-only Olyx Agent for selected beta deployments. This path is useful when a workload needs to reach private models or internal tools, but it adds a different latency shape from a direct public-provider call.

flowchart LR APP[APP] OLYX[OLYX] AGENT[OUTBOUND AGENT] PRIVATE[PRIVATE MODEL OR TOOL] TRACE[TRACE] APP --> OLYX OLYX --> AGENT AGENT --> PRIVATE PRIVATE --> AGENT AGENT --> OLYX OLYX --> APP OLYX --> TRACE

For private routes, measure the public and private paths separately. The agent path includes agent connection health, queue depth, local network time, and private model startup behavior.

Segment	What affects it
Olyx to agent dispatch	Agent connection health, queue depth, and region distance
Agent to private model/tool	Local network path, private service load, and cold starts
Agent result return	Response size and outbound network path

For sensitive workloads, correctness and network posture may matter more than matching public-provider latency. The closed-beta expectation is visibility, not a hard latency promise.

Load Testing

Load test the same path your application will use in production. During closed beta, use a dedicated test project and dedicated API keys. Do not run synthetic load against production keys or customer traffic until beta limits are confirmed.

Create one test key per worker in the dashboard or through the API reference. Then pass those keys to the load runner. The example below uses the Python SDK so the test exercises the same execute and trace-completion path as application code.

import concurrent.futures
import os
import statistics
import time
import olyx

API_KEYS = [
    key.strip()
    for key in os.environ["OLYX_LOAD_TEST_KEYS"].split(",")
    if key.strip()
]

BASE_URL = os.environ.get("OLYX_GATEWAY_URL", "https://gateway.olyx.ai")


def run_once(index: int):
    key = API_KEYS[index % len(API_KEYS)]
    client = olyx.Olyx(api_key=key, base_url=BASE_URL, mock=False)

    trace = client.traces.create(
        metadata={"source": "sdk_load_test", "worker": index}
    )

    started = time.perf_counter()
    try:
        client.execute(trace_id=trace.id, input="Return the word ok.")
    finally:
        client.traces.complete(trace.id)

    details = client.traces.find(trace.id)
    summary = details.summary or {}

    return {
        "wall_ms": round((time.perf_counter() - started) * 1000, 2),
        "total_latency_ms": summary.get("total_latency_ms"),
        "chain_depth": summary.get("chain_depth"),
        "tool_overhead_ms": summary.get("tool_overhead_ms"),
    }


with concurrent.futures.ThreadPoolExecutor(max_workers=10) as pool:
    results = list(pool.map(run_once, range(50)))

wall_times = [row["wall_ms"] for row in results]
print({
    "runs": len(results),
    "avg_wall_ms": round(statistics.mean(wall_times), 2),
    "p95_wall_ms": sorted(wall_times)[int(len(wall_times) * 0.95) - 1],
})

This script is intentionally small. It is good for a closed-beta smoke load test, not a substitute for a full production capacity plan.

What to Watch

After a load test, inspect completed traces rather than relying on a single aggregate number. A healthy run has stable latency, no unexpected gateway errors, and no circuit-breaker trips from accidental single-key bursts.

Metric	Investigate when
Total trace latency	p95 grows after no application, provider, or region change
Tool overhead	Tool-heavy traces dominate end-to-end latency
Chain depth	Agents repeatedly call tools when a single call should be enough
Unexpected 5xxs	Errors are coming from the gateway path rather than a provider or private tool
Loop detection	Synthetic load is concentrated on one key or the workflow is looping

If you intentionally run bursts from a single key, use a test project and reset the project settings afterward. For normal load tests, one key per worker gives cleaner results and avoids confusing the loop detector.