Scaling Guide

Sharding Strategy

Meridian routes inference traffic across hundreds of model deployments. A deliberate sharding strategy keeps tail latency low, isolates noisy tenants, and lets a single gateway absorb spikes without hot-spotting any one upstream. This guide walks through how we partition workload across regions, accounts, and deployments.

1. Tenant-Aware Hash Sharding

Each request carries a tenant key derived from the API token. We hash it with a stable function (xxhash64) and map it to a shard bucket. Buckets are reassigned only on topology changes, which keeps cache affinity high and avoids reshuffling traffic when a single deployment is rotated.

  • Stable mapping survives gateway restarts.
  • Hot tenants get pinned to dedicated buckets.
  • Bucket weights bias high-priority customers.

2. Region & Capacity Pools

Shards are grouped into capacity pools by region. The router picks the closest healthy pool and falls through to neighbors when its error budget burns. Pool health is tracked with a rolling window of latency, 5xx ratio, and outstanding queue depth.

pools:
  swedencentral:
    weight: 1.0
    deployments: [gpt-4.1, gpt-4o, llama4-mav]
  eastus2:
    weight: 0.7
    deployments: [gpt-4.1, deepseek-v3-2, kimi-k2-6]
  failover:
    weight: 0.1
    deployments: [gpt-4o-mini, llama3-70b]

3. Rebalancing & Drain

When a deployment is rotated, we drain it by lowering its bucket weight rather than yanking it out. New requests bleed away to siblings while in-flight calls finish on the original. Once the drain window closes, the bucket is reassigned and the deployment recycles for upgrades.

  • Drain windows default to 90 seconds.
  • Weight changes propagate via the control plane.
  • No client-visible disconnects during rotation.

Need help mapping your workload? Reach the Meridian team via the docs index.