ModelDeployment Custom Resource
Deploy a model to the fleet, from a single pod to disaggregated prefill and decode.
Concept guide: Deploy a Model →
#Metadata
#Example
Manifest
apiVersion: modelplane.ai/v1alpha1
kind: ModelDeployment
metadata:
name: qwen3-8b
namespace: ml-team
spec:
replicas: 2
engines:
- name: qwen3-8b
members:
- role: Standalone
nodeSelector:
devices:
- name: gpu
count: 1
selectors:
- cel: |
device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("20Gi")) >= 0
template:
spec:
containers:
- name: engine
image: vllm/vllm-openai:v0.23.0
args:
- --model=Qwen/Qwen3-8B
#Spec
Optional label selector to filter InferenceClusters. If omitted, all ready clusters are candidates.
How many identical copies of this engine to run per ModelReplica. A fixed number, sized once per deployment; scaling happens by adding ModelReplicas (spec.replicas), never by varying copies. Maps to the composed Deployment’s or LeaderWorkerSet’s replica count. Defaults to 1.
Identifies the engine within the deployment. Becomes part of the composed workload names, so it must be a DNS label.
The engine’s phase in a PrefillDecode deployment, Prefill or Decode. Set only when serving.mode is PrefillDecode, where exactly one engine takes each phase.
Reference to a ModelCache in the same namespace. Optional for single-node engines; required for any engine that spans multiple nodes (a Leader with one or more Workers), since every pod in the gang mounts it.
ModelCache resource name in the same namespace.
How many ModelReplicas to fan out to. Each replica is a complete serving instance scheduled to one InferenceCluster.
How the deployment is served from the cluster edge to its engines. Unified (the default) fronts the engines with a Service. PrefillDecode serves prefill and decode from the two engines marking those phases, with inference-aware routing that sequences prefill then decode. Omitted means Unified.
Unified serves prefill and decode on one engine. PrefillDecode splits them across two engines, each marking its phase as Prefill or Decode.
#Status
#Subresources
#EngineMember object
One member of an engine’s gang — a role (Standalone, Leader, or Worker) with its hardware and pod template.
The per-node device request for this member’s pods: what devices each pod needs from its node. The scheduler matches it against a candidate pool’s InferenceClass devices (surfaced on InferenceCluster status.gpuPools) and places the member on a pool that satisfies it, preferring one pool for the whole engine and splitting members across pools only when no pool satisfies them all. claim: DRA requests also become DeviceRequests in the ResourceClaim the member’s pods bind GPUs through. A GPU request’s count is the GPUs per node. Omitted, the member claims no devices and schedules onto its engine’s pool - a coordinator-only leader. At least one member per engine must carry a nodeSelector, and at least one member’s requests must resolve to a claimable (claim: DRA) device; an engine that matches only synthetic devices leaves its pods nothing to claim, so the scheduler treats such a pool as ineligible and the deployment reports InsufficientCapacity.
The member’s role in the engine. Standalone is a lone pod; a Leader coordinates and serves while its Workers join it. Defaults to Standalone.
Pod template for this member’s engine pods. A curated subset of PodTemplateSpec.
Settings for a Worker member. Valid only on a member whose role is Worker.
How many nodes this member spans - how big the engine is. Each node runs one worker pod, so the engine’s gang spans 1 (the Leader) plus this many worker pods. Defaults to 1.
#NodeSelector object
The per-node device request for this member’s pods: what devices each pod needs from its node. The scheduler matches it against a candidate pool’s InferenceClass devices (surfaced on InferenceCluster status.gpuPools) and places the member on a pool that satisfies it, preferring one pool for the whole engine and splitting members across pools only when no pool satisfies them all. claim: DRA requests also become DeviceRequests in the ResourceClaim the member’s pods bind GPUs through. A GPU request’s count is the GPUs per node. Omitted, the member claims no devices and schedules onto its engine’s pool - a coordinator-only leader. At least one member per engine must carry a nodeSelector, and at least one member’s requests must resolve to a claimable (claim: DRA) device; an engine that matches only synthetic devices leaves its pods nothing to claim, so the scheduler treats such a pool as ineligible and the deployment reports InsufficientCapacity.
How many matching devices a node must have. For a GPU request this is the per-node GPU count.
Name of this request. Mirrors a DRA DeviceRequest name; carried through to the ResourceClaim.
A DRA CEL expression evaluated against one device. Reads device.driver, device.attributes["
#PodTemplate object
Pod template for this member’s engine pods. A curated subset of PodTemplateSpec.
Metadata applied to the member’s pods. Useful for labels and annotations that control cluster-level features like service mesh injection.
Pod spec for this member’s engine pods.
Reference a pod field via the downward API, e.g. status.podIP, metadata.name, or metadata.namespace.
Container image.
Container name. The container named “engine” is the inference engine.