Modelplane Modelplane docs

How Modelplane works

Modelplane runs as a control plane on its own cluster, the control cluster, above the inference clusters that actually serve models. It’s built on Crossplane: platform teams and developers describe what they want as Kubernetes resources, and Modelplane continuously reconciles the fleet to match, composing the clusters, scheduling replicas, and exposing endpoints. This page is the full tour. It covers the architecture and resources, then walks through what happens when you deploy a model.

Modelplane API

Modelplane’s API is two sets of resources, one per team, with everything in between filled in for you. Platform teams describe the fleet, ML teams describe a model, and Modelplane composes the rest.

The hierarchy mirrors Kubernetes core one scope up: ModelDeploymentModelReplicaModelServiceModelEndpoint parallels DeploymentPodServiceEndpoint, across a fleet instead of within a single cluster.

What the control plane reconciles

Once the resources exist, Modelplane keeps the fleet matching them. Five concerns run continuously:

  1. Provisioning. From an InferenceCluster, Modelplane creates a full cluster and its GPU node pools, or brings in a cluster you already run on any Kubernetes, and installs the serving stack on each.
  2. Scheduling. A two-level scheduler places work: it pins each ModelReplica to a cluster and pool whose hardware meets the model’s requirements, then the cluster’s own scheduler binds the GPUs to the serving pods through DRA.
  3. Autoscaling. Replicas are the scaling axis. Scaling a ModelDeployment’s spec.replicas adds or removes whole serving instances through the standard Kubernetes scale subresource, so kubectl scale or a KEDA ScaledObject work out of the box.
  4. Routing. A ModelService exposes one OpenAI-compatible endpoint through the gateway and load-balances across the deployment’s ModelEndpoints, wherever their replicas run. ModelEndpoints can also point at external inference services.
  5. Caching. A ModelCache stages model weights on cluster storage once, so serving pods read them locally instead of re-downloading on every start.

Universal compatibility

Modelplane is deliberately unopinionated about the engine. A ModelDeployment describes the shape of a deployment, how many pods, on how many nodes, with which devices, and nothing about how the engine runs internally. The engine flags you write carry parallelism (tensor, pipeline, data, expert), quantization, and KV transfer; Modelplane never injects them.

This is what lets one API serve any container-based engine and any topology without special cases. Modelplane composes the engine onto the right cluster resource and injects almost nothing, just the address a multi-node leader is reachable at, so a worker can join it. New engines and new parallelism strategies work without a change to Modelplane. The community publishes recipes (worked, copyable manifests) to bridge the gap that flexibility leaves, rather than hard-coding choices into the API.

Fleet scheduler

For each replica, the scheduler picks a (cluster, pool) in two steps:

  1. Filter clusters by clusterSelector.matchLabels against the standard Kubernetes labels on each InferenceCluster, the organizational metadata: tier, region, provider, compliance posture.
  2. Filter pools by matching each device request in the deployment’s nodeSelector.devices against the pool’s InferenceClass. A request is based on DRA: a count and CEL selectors over a device’s attributes and capacity, like “a GPU with at least 141Gi of memory.” A pool fits when it has the devices the model asks for and enough free nodes to hold a replica.

Capacity is accounted at the node level across the fleet, so Modelplane never overcommits a pool. Replicas are pinned to their cluster once placed and stay there across reconciles; if a cluster is deleted, the scheduler re-places its replicas elsewhere. How it schedules covers the placement rules and their limits in full.

Deploying a model

Creating a ModelDeployment kicks off the loop end to end. The scheduler discovers the ready clusters (filtered by your label selector if you set one), matches each engine’s device requests against their pools, and pins each replica to a cluster that fits. Modelplane composes a ModelReplica on each chosen cluster, turns it into the right serving workload there, creates a ModelEndpoint per replica, and your ModelService routes traffic across them through one stable endpoint on the gateway. Scale the deployment up or down and the same loop re-converges.

Serving topologies

A single-node deployment composes to a Kubernetes Deployment fronted by a service. When a model is too large for one node, an engine becomes a gang: a Leader member and one or more Worker members that Modelplane composes into a LeaderWorkerSet, serving the model together across nodes. Gang deployments should stage their weights through a ModelCache, so the pods share one copy instead of each pulling the same model.

Disaggregated serving splits prefill and decode into separate engines (serving.mode: PrefillDecode) that run on the same cluster and hand off the KV cache between them. Modelplane wires up the cluster-edge routing that pairs each request’s prefill and decode; the engines carry the KV-transfer flags. Both are described in full in the model deployment docs.

Next steps