Modelplane Modelplane docs

ModelDeployment Custom Resource

Deploy a model to the fleet, from a single pod to disaggregated prefill and decode.

Concept guide: Deploy a Model →

#Metadata

API version
modelplane.ai/v1alpha1
Kind
ModelDeployment
Scope
Namespaced
Short names
md

#Example

Manifest
apiVersion: modelplane.ai/v1alpha1
kind: ModelDeployment
metadata:
  name: qwen3-8b
  namespace: ml-team
spec:
  replicas: 2
  engines:
    - name: qwen3-8b
      members:
        - role: Standalone
          nodeSelector:
            devices:
              - name: gpu
                count: 1
                selectors:
                  - cel: |
                      device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("20Gi")) >= 0
          template:
            spec:
              containers:
                - name: engine
                  image: vllm/vllm-openai:v0.23.0
                  args:
                    - --model=Qwen/Qwen3-8B

#Spec

# clusterSelector optional object

Optional label selector to filter InferenceClusters. If omitted, all ready clusters are candidates.

# matchLabels optional map[string]string
# engines required object[] 1–8 items
# copies optional integer 1–64 default: 1

How many identical copies of this engine to run per ModelReplica. A fixed number, sized once per deployment; scaling happens by adding ModelReplicas (spec.replicas), never by varying copies. Maps to the composed Deployment’s or LeaderWorkerSet’s replica count. Defaults to 1.

# members required EngineMember[] → 1–2 items
# name required string 1–63 chars

Identifies the engine within the deployment. Becomes part of the composed workload names, so it must be a DNS label.

# phase optional enum: Prefill | Decode

The engine’s phase in a PrefillDecode deployment, Prefill or Decode. Set only when serving.mode is PrefillDecode, where exactly one engine takes each phase.

# modelCacheRef optional object

Reference to a ModelCache in the same namespace. Optional for single-node engines; required for any engine that spans multiple nodes (a Leader with one or more Workers), since every pod in the gang mounts it.

# name required string ≥ 1 chars

ModelCache resource name in the same namespace.

# replicas required integer 1–10

How many ModelReplicas to fan out to. Each replica is a complete serving instance scheduled to one InferenceCluster.

# serving optional object

How the deployment is served from the cluster edge to its engines. Unified (the default) fronts the engines with a Service. PrefillDecode serves prefill and decode from the two engines marking those phases, with inference-aware routing that sequences prefill then decode. Omitted means Unified.

# mode required enum: Unified | PrefillDecode default: Unified

Unified serves prefill and decode on one engine. PrefillDecode splits them across two engines, each marking its phase as Prefill or Decode.

#Status

# replicas optional object
# ready optional integer
# total optional integer

#Subresources

# scale subresource

A scale subresource, so kubectl scale and replica-based autoscalers can drive it.

# specReplicasPath .spec.replicas

JSONPath to the desired replica count in spec.

# statusReplicasPath .status.replicas.total

JSONPath to the observed replica count in status.


#EngineMember object

One member of an engine’s gang — a role (Standalone, Leader, or Worker) with its hardware and pod template.

# nodeSelector optional NodeSelector →

The per-node device request for this member’s pods: what devices each pod needs from its node. The scheduler matches it against a candidate pool’s InferenceClass devices (surfaced on InferenceCluster status.gpuPools) and places the member on a pool that satisfies it, preferring one pool for the whole engine and splitting members across pools only when no pool satisfies them all. claim: DRA requests also become DeviceRequests in the ResourceClaim the member’s pods bind GPUs through. A GPU request’s count is the GPUs per node. Omitted, the member claims no devices and schedules onto its engine’s pool - a coordinator-only leader. At least one member per engine must carry a nodeSelector, and at least one member’s requests must resolve to a claimable (claim: DRA) device; an engine that matches only synthetic devices leaves its pods nothing to claim, so the scheduler treats such a pool as ineligible and the deployment reports InsufficientCapacity.

# role optional enum: Standalone | Leader | Worker default: Standalone

The member’s role in the engine. Standalone is a lone pod; a Leader coordinates and serves while its Workers join it. Defaults to Standalone.

# template required PodTemplate →

Pod template for this member’s engine pods. A curated subset of PodTemplateSpec.

# worker optional object

Settings for a Worker member. Valid only on a member whose role is Worker.

# nodes required integer 1–63

How many nodes this member spans - how big the engine is. Each node runs one worker pod, so the engine’s gang spans 1 (the Leader) plus this many worker pods. Defaults to 1.


#NodeSelector object

The per-node device request for this member’s pods: what devices each pod needs from its node. The scheduler matches it against a candidate pool’s InferenceClass devices (surfaced on InferenceCluster status.gpuPools) and places the member on a pool that satisfies it, preferring one pool for the whole engine and splitting members across pools only when no pool satisfies them all. claim: DRA requests also become DeviceRequests in the ResourceClaim the member’s pods bind GPUs through. A GPU request’s count is the GPUs per node. Omitted, the member claims no devices and schedules onto its engine’s pool - a coordinator-only leader. At least one member per engine must carry a nodeSelector, and at least one member’s requests must resolve to a claimable (claim: DRA) device; an engine that matches only synthetic devices leaves its pods nothing to claim, so the scheduler treats such a pool as ineligible and the deployment reports InsufficientCapacity.

# devices required object[] 1–16 items
# count optional integer 1–64 default: 1

How many matching devices a node must have. For a GPU request this is the per-node GPU count.

# name required string 1–63 chars

Name of this request. Mirrors a DRA DeviceRequest name; carried through to the ResourceClaim.

# selectors required object[] 1–8 items
# cel optional string 1–10240 chars

A DRA CEL expression evaluated against one device. Reads device.driver, device.attributes[""]. (typed), and device.capacity[""]. (a Quantity), with quantity() and semver() helpers, e.g. device.capacity[“gpu.nvidia.com”].memory.compareTo(quantity(“141Gi”)) >= 0.


#PodTemplate object

Pod template for this member’s engine pods. A curated subset of PodTemplateSpec.

# metadata optional object

Metadata applied to the member’s pods. Useful for labels and annotations that control cluster-level features like service mesh injection.

# annotations optional map[string]string
# labels optional map[string]string
# spec optional object

Pod spec for this member’s engine pods.

# containers required object[] 1–1 items
# args optional string[]
# command optional string[]
# env optional object[]
# name required string
# value optional string
# valueFrom optional object
# configMapKeyRef optional object
# key required string
# name required string
# optional optional boolean
# fieldRef optional object

Reference a pod field via the downward API, e.g. status.podIP, metadata.name, or metadata.namespace.

# apiVersion optional string
# fieldPath required string
# secretKeyRef optional object
# key required string
# name required string
# optional optional boolean
# image required string ≥ 1 chars

Container image.

# name required string ≥ 1 chars

Container name. The container named “engine” is the inference engine.

# imagePullSecrets optional object[]
# name required string