Radiator Server Documentation — v10.33.1

Service Level Objective

How to configure service-level-objective blocks to enable automatic circuit breaking and degradation for backend servers.

Service Level Objective

A service-level-objective block configures automatic health monitoring for a backend server. When a server violates the configured thresholds, Radiator marks it as degraded and routes requests to healthier servers. Once the server recovers, traffic returns automatically.

Quick start example

When server-selection is configured, Radiator automatically applies a default service-level-objective to every server that does not have an explicit block:

backends {
    postgres "USERS" {
        server-selection fallback;

        server "primary" {
            host "pg-primary.example.com";
            database "radiator";
            username "radiator";
            connections 10;
        }

        server "replica" {
            host "pg-replica.example.com";
            database "radiator";
            username "radiator";
            connections 10;
        }

        query "FIND_USER" { ... }
    }
}

Both servers above automatically receive:

service-level-objective {
    failure-rate 3/5;
    initial-backoff-period 3s;
    max-backoff-period 30s;
    recovery-probe-count 2;
}

Add an explicit service-level-objective block to a server to override these defaults. An explicit block only needs to specify the fields that differ; omitted fields keep their default values.

Servers without server-selection have no automatic SLO. Add an explicit service-level-objective block if health monitoring is needed for a single-server backend.

Parameters

failure-rate

failure-rate <failures>/<window_size>;

Defines a sliding window failure rate. Radiator tracks the last window_size request outcomes (successes and errors) for this server. If the number of failures reaches failures, the server is marked degraded.

  • failures must be between 1 and window_size (inclusive).
  • Timeouts and connection errors both count as failures. Pool-exhaustion events (all connections busy) do not count as failures — they are tracked separately via the PoolExhausted counter and do not affect SLO health.
  • The window fills incrementally. No violation triggers until the window has window_size total outcomes.
  • failure-rate 1/5 means any single failure in the last 5 requests degrades the server.
  • failure-rate 5/5 means all 5 of the last 5 requests must fail before degrading.

Default: failure-rate 3/5. A server is marked degraded when 3 or more of the last 5 requests fail.

max-backoff-period

max-backoff-period <duration>;

Controls time-based exponential backoff for degraded servers. When a server is degraded, Radiator waits an increasing duration before retrying:

  1. Wait initial-backoff-period, then retry
  2. Wait double that duration, then retry
  3. Continue doubling up to duration

Accepts time units: 30s, 1m, 2h.

Set to 0s to disable backoff entirely — the degraded server is retried on every incoming request instead of waiting. See Disabling backoff.

Default: 30s. Backoff doubles from initial-backoff-period up to 30 seconds.

initial-backoff-period

initial-backoff-period <duration>;

Sets the starting wait time before the first retry of a degraded server. Subsequent retries double this value up to max-backoff-period.

Accepts time units: 1s, 500ms, 5s.

Default: 3s. The first backoff waits 3 seconds, then doubles.

recovery-probe-count

recovery-probe-count <count>;

Sets the number of consecutive successful probe requests required before a degraded server is considered recovered. Must be at least 1.

Default: 2. See How degradation works for the full recovery and backoff behaviour.

How degradation works

When a server's failure rate is reached, Radiator marks it as degraded and stops routing normal traffic to it. While degraded, the server is still re-tried periodically using real incoming requests that get routed to it — these are the probes. If other servers are available, most traffic continues to flow to them; the degraded server only sees the occasional probe until it recovers.

Backoff and recovery

Once degraded, Radiator waits initial-backoff-period before routing the first probe to the server. If the probe fails, the wait doubles for the next attempt, up to max-backoff-period. If the probe succeeds, the next probe is sent after initial-backoff-period again (not the doubled value). Once recovery-probe-count consecutive probes succeed, the server returns to normal and all backoff state resets.

Backoff timeline example

With initial-backoff-period 3s, max-backoff-period 30s, and recovery-probe-count 2:

EventWait for next probe
Server degrades3s
Probe fails6s
Probe fails12s
Probe fails24s
Probe fails30s (capped at max-backoff-period)
Probe succeeds (1 of 2)3s (held at initial-backoff-period)
Probe succeeds (2 of 2)— recovered

The stored backoff period is not reset until full recovery — if a probe succeeds and the next probe fails, the wait resumes from the last doubled value (30s in this example).

The wait only doubles when the failure count in the sliding window meets or exceeds failures. A probe failure that does not push the window into violation still resets the probe count but does not double the wait.

Monitoring

Each server emits counters when SLO violations occur. These counters appear in the management API under the server's namespace (e.g., backend/Postgres/USERS/primary/):

CounterDescription
SLOFailureThresholdViolationsIncremented each time the server enters degraded state due to a failure threshold violation.
SLORecoveredIncremented each time the server recovers from degraded state after a successful retry.
SLOStillFailingIncremented each time a degraded server is retried but still violates the SLO.

SLOFailureThresholdViolations only increments on the transition to degraded state. SLORecovered increments when a retry succeeds. SLOStillFailing increments when a retry of a degraded server results in another SLO violation.

Interaction with server selection policies

The SLO system works with all server selection policies:

  • fallback: Degraded primary is skipped; secondary becomes active. Primary recovers automatically when retries succeed.
  • round-robin: Degraded servers are excluded from the rotation. Traffic redistributes across healthy servers.
  • least-connections: Degraded servers are excluded from the pool sort. Healthy servers absorb the load.

When server-selection is configured, every server without an explicit service-level-objective block automatically receives the default SLO (failure-rate 3/5, initial-backoff-period 3s, max-backoff-period 30s, recovery-probe-count 2). For single-server backends (no server-selection), add an explicit service-level-objective block to enable health monitoring — without one, the server receives traffic unconditionally.

Disabling backoff

To retry a degraded server on every incoming request instead of waiting between retries, set max-backoff-period to 0s:

service-level-objective {
    failure-rate 3/5;
    max-backoff-period 0s;
}

The server is still marked degraded when the failure rate is exceeded, but every request will attempt to use the degraded server instead of waiting. This can be useful when the backend has its own health checks or when immediate retry is preferred over exponential backoff.

Omitting max-backoff-period does not disable backoff — it uses the default value of 30s. You must explicitly set 0s to disable it.

Disabling SLO

For a single-server backend (no server-selection), omitting the service-level-objective block is enough — without an explicit block, no SLO is applied and the server receives traffic unconditionally.

For a server inside a server-selection backend, the default SLO is injected automatically and there is no per-server opt-out via configuration. To avoid SLO on a specific server, restructure the backend to not use server-selection.