Deploy a Model
API: modelplane.ai/v1alpha1 · ModelDeployment
A ModelDeployment is the ML team’s primary interface. You describe the model
you want served, the hardware it needs, and how many copies to run; Modelplane
schedules it onto matching clusters and keeps it running. You never name a
cluster.
Modelplane is unopinionated about the engine itself. You bring the container and its flags, and Modelplane shapes a serving topology around it. The engine flags you write carry parallelism, quantization, and KV transfer, never injected by Modelplane.
A deployment’s spec.engines describes its topology through two choices:
- One pod or a gang: whether an engine is a single
Standalonepod or aLeaderwith one or moreWorkerpods coordinating across nodes. - Unified or disaggregated: whether
spec.serving.modekeeps prefill and decode together (Unified, the default) or splits them across two engines (PrefillDecode).
How many of each to run is a separate question, covered in Sizing a deployment.
Single-node
The default, and what the getting started tour
deploys. One Standalone member is one pod on one node, claiming that node’s
GPUs through its nodeSelector. It’s usually the right choice when a model fits
on a single node. Within a node, tensor parallelism is an engine flag
(--tensor-parallel-size), not a Modelplane concept.
engines:
- name: qwen
members:
- role: Standalone # one pod, one nodeMulti-node
When a model is too large for one node’s GPUs, make the engine a gang: a Leader
and a Worker whose worker.nodes expands to that many worker pods, one per
node. The pods serve the model together; how the model splits across them
(tensor, pipeline, data, or expert parallelism) is up to your engine flags.
A gang should use a ModelCache via
spec.modelCacheRef, so every pod mounts the same weights instead of each
pulling its own.
modelCacheRef:
name: qwen3-coder # recommended for gangs
engines:
- name: qwen3-coder
members:
- role: Leader
- role: Worker
worker:
nodes: 1 # one worker pod per nodeA member’s env can read pod fields through valueFrom.fieldRef, like setting
vLLM’s VLLM_HOST_IP from status.podIP, which multi-NIC RDMA nodes need so the
engine binds the right interface instead of guessing it.
Disaggregated serving
The prefill and decode phases have opposite hardware profiles, and on one engine
a prefill burst stalls the decodes already running. Set
spec.serving.mode: PrefillDecode to run them as two engines, one marking
phase: Prefill and the other phase: Decode. Modelplane fronts the pair with
inference-aware routing that sequences prefill then decode, moving the KV cache
between them. Each phase can sit on the GPU class that suits it.
serving:
mode: PrefillDecode # the two engines below are one P/D pair
engines:
- name: prefill
phase: Prefill
- name: decode
phase: DecodeDisaggregation pays off for large models under load with strict latency targets and long context. For small models or low traffic, the KV-transfer overhead outweighs the benefit, so unified serving is the default.
It requires an engine image that includes the NIXL KV-transfer runtime.
vLLM’s NixlConnector (and SGLang’s prefill/decode transfer) import the nixl
package, so disaggregated engines crash at startup with NIXL is not available
on an image that lacks it. Recent vanilla vllm/vllm-openai images include NIXL,
so pin a current tag rather than an old one. The engine image is yours to choose,
so this is a prerequisite Modelplane does not bundle for you.
Requesting GPUs
You don’t name a cluster or a GPU model. Instead each member’s nodeSelector
lists the hardware its pods need, and Modelplane finds a node pool that has it.
The platform team publishes node pools as InferenceClass resources, each
describing the devices its nodes carry. Your request is matched against them.
A request names a device (gpu), how many of it each pod needs (count), and
one or more selectors the device must match:
nodeSelector:
devices:
- name: gpu
count: 1 # one GPU per pod
selectors:
- cel: |
device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("40Gi")) >= 0Each selector is a single line of CEL, a small expression
language, that returns true or false for one device. The part in brackets, "gpu.nvidia.com", is the
GPU vendor’s driver. The fields after it, like memory or architecture, are
what the platform team published for that device. This one says “match a GPU
whose memory is at least 40Gi.” A device has to match every selector in the
request. Give two selectors to mean “Hopper, with at least 80Gi.”
Requesting more than one device
devices is a list, so a member can ask for distinct kinds of hardware at once,
each its own entry with its own count and selectors. A node pool matches the
member only when it satisfies every entry. This is how you ask for both a GPU and
a fast NIC on the same node:
nodeSelector:
devices:
- name: gpu
count: 8
selectors:
- cel: device.attributes["gpu.nvidia.com"].architecture == "Hopper"
- name: nic
count: 1
selectors:
- cel: device.attributes["nic.nvidia.com"].linkType == "infiniband"What you can match on
Each selector is evaluated against one device and must return a boolean. The device exposes three things:
device.driver: the device’s driver, a string.device.attributes["<driver>"].<name>: a typed attribute (string, bool, int, or version), such asarchitectureorcudaComputeCapability.device.capacity["<driver>"].<name>: a capacity quantity, such asmemory.
Two helpers build comparable values: quantity() parses Kubernetes quantities
like "40Gi", and semver() parses versions like "9.0.0". Both support
compareTo (which orders two values), isGreaterThan, and isLessThan. Combine
selectors with the usual CEL operators (==, !=, >=, &&, ||).
selectors:
# Capacity: at least 40Gi of GPU memory. >= 0 reads as "left is at least right".
- cel: device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("40Gi")) >= 0
# Attribute equality: a specific architecture.
- cel: device.attributes["gpu.nvidia.com"].architecture == "Hopper"
# Version attribute: a minimum CUDA compute capability.
- cel: device.attributes["gpu.nvidia.com"].cudaComputeCapability.isGreaterThan(semver("8.9.0"))
# Driver: match any device from a given driver.
- cel: device.driver == "gpu.nvidia.com"
# Presence: only match a device that publishes a given domain.
- cel: '"gpu.nvidia.com" in device.attributes'
# Two conditions in one selector.
- cel: |
device.attributes["gpu.nvidia.com"].architecture == "Hopper" &&
device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("80Gi")) >= 0This is the Kubernetes DRA device selector expression surface. The Kubernetes-specific CEL extension libraries (such as regular expressions and IP address helpers) aren’t available. Selectors in practice are attribute and capacity comparisons like those above.
Seeing what’s available
To see what you can match against, list the classes the platform team has published and look at the devices each one declares:
kubectl get inferenceclass
kubectl describe inferenceclass gke-l4-1x-g2The describe output shows each device’s driver, attributes (like
architecture), and capacity (like memory), which are exactly the keys your
selectors read. If a selector asks for something no published class offers, the
deployment won’t schedule.
Sizing a deployment
Three independent numbers control how many pods a deployment runs:
spec.replicasstamps out whole copies of the entire topology. Each replica is a complete serving instance, and replicas usually land on different clusters. This is the scaling axis (see Scaling).engines[].copiesruns several identical copies of one engine within a replica, on the same cluster. It’s a fixed number, sized once, never autoscaled. Copies make a replica more resilient within its cluster: a node failure drops one copy instead of taking the whole replica out of service. In disaggregated serving they also set the prefill-to-decode ratio.worker.nodessets how many nodes one gang spans: aLeaderplus that manyWorkerpods. It’s how big a single multi-node engine is.
Scaling
spec.replicas is the only scaling axis. Each replica is a complete,
fixed-shape serving instance, so scaling adds or removes whole instances across
the fleet. Because the deployment exposes the Kubernetes scale subresource,
kubectl scale and KEDA work without anything extra. There’s no in-cluster pod
autoscaling.
Choosing a topology
| Topology | Use when | How you set it |
|---|---|---|
| Single-node | The model fits on one node’s GPUs | One Standalone member (the default) |
| Multi-node | The model is too large for one node | A Leader and one or more Worker members, ideally with a modelCacheRef |
| Disaggregated serving | Large model, heavy load, strict latency, long context | serving.mode: PrefillDecode with two phase engines |
Examples
# A ModelDeployment deploys a model to one or more inference clusters.
# The scheduler picks clusters by clusterSelector labels and nodeSelector
# device requests, gated on available nodes. Each matched cluster gets one
# ModelReplica.
#
# The control plane creates a unified OpenAI-compatible endpoint:
# http://<gateway-address>/<namespace>/<name>/v1/chat/completions
apiVersion: modelplane.ai/v1alpha1
kind: ModelDeployment
metadata:
name: qwen3-8b
namespace: ml-team
spec:
# Number of ModelReplicas to fan out to. Each replica is a complete
# serving instance scheduled to one InferenceCluster.
replicas: 1
# Optional: restrict the scheduler to clusters with specific labels.
# clusterSelector:
# matchLabels:
# modelplane.ai/region: us-central
# Engines are an array of inference engines. This model is one engine, one
# Standalone member, one pod - the simplest shape. The engine composes to a
# Deployment fronted by a Service.
engines:
- name: qwen3-8b
members:
# A Standalone member is a single self-contained engine pod. Its template
# carries the container named "engine" - the inference engine; its image,
# command, and args pass through verbatim.
- role: Standalone
# The member's per-node device request: a list of DRA device requests
# describing what each of the member's pods needs from its node. The
# scheduler matches each against a candidate pool's InferenceClass
# devices and pins the member to a pool that satisfies them. Each
# request's CEL is real DRA CEL over a single device; quantity() and
# semver() are helpers. claim: DRA devices also become requests in the
# DRA ResourceClaim the serving pods claim GPUs through, so an engine
# must declare the GPUs it needs.
nodeSelector:
devices:
- name: gpu
count: 1
selectors:
# Qwen3-8B fits comfortably on an L4; 20Gi selects one without
# over-constraining. A larger model would ask for more memory or a
# specific architecture here. This CEL is real DRA CEL: the scheduler
# matches it against the pool's declared device, and DRA matches it
# again against the GPU's ResourceSlice when it binds the claim.
- cel: |
device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("20Gi")) >= 0
template:
spec:
containers:
- name: engine
image: vllm/vllm-openai:v0.23.0
args:
- "--model=Qwen/Qwen3-8B"
- "--served-model-name=qwen"
- "--reasoning-parser=qwen3"
- "--default-chat-template-kwargs={\"enable_thinking\": false}"
- "--enable-auto-tool-choice"
- "--tool-call-parser=hermes"
# A ModelDeployment serving one model across two nodes.
#
# When a model is too large to fit on one node's GPUs, make an engine a gang:
# give it a Leader and a Worker member, whose worker.nodes expands to that many
# worker pods, one per node. The scheduler picks a cluster with a pool that has
# enough GPUs per node and enough nodes for the whole gang, and Modelplane
# composes a LeaderWorkerSet-backed serving instance on it. The worker joins the
# leader through $(MODELPLANE_LEADER_ADDRESS), which Modelplane injects.
#
# Multi-node engines require a ModelCache: every pod in the gang mounts it at
# /mnt/models. When a member brings its own command, Modelplane does not inject
# --model, so the leader points the engine at the mount explicitly.
#
# This shape (vLLM's native multiprocessing backend, TP within a node and PP
# across nodes) is the one validated serving Qwen3-Coder-480B; see
# examples/qwen3-coder/ for the full platform side.
apiVersion: modelplane.ai/v1alpha1
kind: ModelDeployment
metadata:
name: qwen3-coder
namespace: ml-team
spec:
replicas: 1
modelCacheRef:
name: qwen3-coder
engines:
- name: qwen3-coder
members:
- role: Leader
nodeSelector:
devices:
- name: gpu
count: 8
selectors:
- cel: |
device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("120Gi")) >= 0
template:
spec:
containers:
- name: engine
image: vllm/vllm-openai:v0.23.0
command:
- /bin/sh
- -c
- >-
exec vllm serve /mnt/models
--served-model-name=qwen3-coder
--tensor-parallel-size=8
--pipeline-parallel-size=2
--distributed-executor-backend=mp
--nnodes=2 --node-rank=0
--master-addr=$(MODELPLANE_LEADER_ADDRESS)
--max-model-len=32768
--port=8000
- role: Worker
worker:
nodes: 1
nodeSelector:
devices:
- name: gpu
count: 8
selectors:
- cel: |
device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("120Gi")) >= 0
template:
spec:
containers:
- name: engine
image: vllm/vllm-openai:v0.23.0
command:
- /bin/sh
- -c
- >-
exec vllm serve /mnt/models
--served-model-name=qwen3-coder
--tensor-parallel-size=8
--pipeline-parallel-size=2
--distributed-executor-backend=mp
--nnodes=2 --node-rank=1
--master-addr=$(MODELPLANE_LEADER_ADDRESS)
--headless
--max-model-len=32768