Modelplane Modelplane docs

Deploy a Model

API: modelplane.ai/v1alpha1 · ModelDeployment

A ModelDeployment is the ML team’s primary interface. You describe the model you want served, the hardware it needs, and how many copies to run; Modelplane schedules it onto matching clusters and keeps it running. You never name a cluster.

Modelplane is unopinionated about the engine itself. You bring the container and its flags, and Modelplane shapes a serving topology around it. The engine flags you write carry parallelism, quantization, and KV transfer, never injected by Modelplane.

A deployment’s spec.engines describes its topology through two choices:

  • One pod or a gang: whether an engine is a single Standalone pod or a Leader with one or more Worker pods coordinating across nodes.
  • Unified or disaggregated: whether spec.serving.mode keeps prefill and decode together (Unified, the default) or splits them across two engines (PrefillDecode).

How many of each to run is a separate question, covered in Sizing a deployment.

Single-node

The default, and what the getting started tour deploys. One Standalone member is one pod on one node, claiming that node’s GPUs through its nodeSelector. It’s usually the right choice when a model fits on a single node. Within a node, tensor parallelism is an engine flag (--tensor-parallel-size), not a Modelplane concept.

engines:
- name: qwen
  members:
  - role: Standalone        # one pod, one node

Multi-node

When a model is too large for one node’s GPUs, make the engine a gang: a Leader and a Worker whose worker.nodes expands to that many worker pods, one per node. The pods serve the model together; how the model splits across them (tensor, pipeline, data, or expert parallelism) is up to your engine flags.

A gang should use a ModelCache via spec.modelCacheRef, so every pod mounts the same weights instead of each pulling its own.

modelCacheRef:
  name: qwen3-coder         # recommended for gangs
engines:
- name: qwen3-coder
  members:
  - role: Leader
  - role: Worker
    worker:
      nodes: 1              # one worker pod per node

A member’s env can read pod fields through valueFrom.fieldRef, like setting vLLM’s VLLM_HOST_IP from status.podIP, which multi-NIC RDMA nodes need so the engine binds the right interface instead of guessing it.

Disaggregated serving

The prefill and decode phases have opposite hardware profiles, and on one engine a prefill burst stalls the decodes already running. Set spec.serving.mode: PrefillDecode to run them as two engines, one marking phase: Prefill and the other phase: Decode. Modelplane fronts the pair with inference-aware routing that sequences prefill then decode, moving the KV cache between them. Each phase can sit on the GPU class that suits it.

serving:
  mode: PrefillDecode       # the two engines below are one P/D pair
engines:
- name: prefill
  phase: Prefill
- name: decode
  phase: Decode

Disaggregation pays off for large models under load with strict latency targets and long context. For small models or low traffic, the KV-transfer overhead outweighs the benefit, so unified serving is the default.

It requires an engine image that includes the NIXL KV-transfer runtime. vLLM’s NixlConnector (and SGLang’s prefill/decode transfer) import the nixl package, so disaggregated engines crash at startup with NIXL is not available on an image that lacks it. Recent vanilla vllm/vllm-openai images include NIXL, so pin a current tag rather than an old one. The engine image is yours to choose, so this is a prerequisite Modelplane does not bundle for you.

Requesting GPUs

You don’t name a cluster or a GPU model. Instead each member’s nodeSelector lists the hardware its pods need, and Modelplane finds a node pool that has it. The platform team publishes node pools as InferenceClass resources, each describing the devices its nodes carry. Your request is matched against them.

A request names a device (gpu), how many of it each pod needs (count), and one or more selectors the device must match:

nodeSelector:
  devices:
  - name: gpu
    count: 1                # one GPU per pod
    selectors:
    - cel: |
        device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("40Gi")) >= 0

Each selector is a single line of CEL, a small expression language, that returns true or false for one device. The part in brackets, "gpu.nvidia.com", is the GPU vendor’s driver. The fields after it, like memory or architecture, are what the platform team published for that device. This one says “match a GPU whose memory is at least 40Gi.” A device has to match every selector in the request. Give two selectors to mean “Hopper, with at least 80Gi.”

Requesting more than one device

devices is a list, so a member can ask for distinct kinds of hardware at once, each its own entry with its own count and selectors. A node pool matches the member only when it satisfies every entry. This is how you ask for both a GPU and a fast NIC on the same node:

nodeSelector:
  devices:
  - name: gpu
    count: 8
    selectors:
    - cel: device.attributes["gpu.nvidia.com"].architecture == "Hopper"
  - name: nic
    count: 1
    selectors:
    - cel: device.attributes["nic.nvidia.com"].linkType == "infiniband"

What you can match on

Each selector is evaluated against one device and must return a boolean. The device exposes three things:

  • device.driver: the device’s driver, a string.
  • device.attributes["<driver>"].<name>: a typed attribute (string, bool, int, or version), such as architecture or cudaComputeCapability.
  • device.capacity["<driver>"].<name>: a capacity quantity, such as memory.

Two helpers build comparable values: quantity() parses Kubernetes quantities like "40Gi", and semver() parses versions like "9.0.0". Both support compareTo (which orders two values), isGreaterThan, and isLessThan. Combine selectors with the usual CEL operators (==, !=, >=, &&, ||).

selectors:
# Capacity: at least 40Gi of GPU memory. >= 0 reads as "left is at least right".
- cel: device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("40Gi")) >= 0
# Attribute equality: a specific architecture.
- cel: device.attributes["gpu.nvidia.com"].architecture == "Hopper"
# Version attribute: a minimum CUDA compute capability.
- cel: device.attributes["gpu.nvidia.com"].cudaComputeCapability.isGreaterThan(semver("8.9.0"))
# Driver: match any device from a given driver.
- cel: device.driver == "gpu.nvidia.com"
# Presence: only match a device that publishes a given domain.
- cel: '"gpu.nvidia.com" in device.attributes'
# Two conditions in one selector.
- cel: |
    device.attributes["gpu.nvidia.com"].architecture == "Hopper" &&
    device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("80Gi")) >= 0

This is the Kubernetes DRA device selector expression surface. The Kubernetes-specific CEL extension libraries (such as regular expressions and IP address helpers) aren’t available. Selectors in practice are attribute and capacity comparisons like those above.

Seeing what’s available

To see what you can match against, list the classes the platform team has published and look at the devices each one declares:

bash
kubectl get inferenceclass
kubectl describe inferenceclass gke-l4-1x-g2

The describe output shows each device’s driver, attributes (like architecture), and capacity (like memory), which are exactly the keys your selectors read. If a selector asks for something no published class offers, the deployment won’t schedule.

Sizing a deployment

Three independent numbers control how many pods a deployment runs:

  • spec.replicas stamps out whole copies of the entire topology. Each replica is a complete serving instance, and replicas usually land on different clusters. This is the scaling axis (see Scaling).
  • engines[].copies runs several identical copies of one engine within a replica, on the same cluster. It’s a fixed number, sized once, never autoscaled. Copies make a replica more resilient within its cluster: a node failure drops one copy instead of taking the whole replica out of service. In disaggregated serving they also set the prefill-to-decode ratio.
  • worker.nodes sets how many nodes one gang spans: a Leader plus that many Worker pods. It’s how big a single multi-node engine is.

Scaling

spec.replicas is the only scaling axis. Each replica is a complete, fixed-shape serving instance, so scaling adds or removes whole instances across the fleet. Because the deployment exposes the Kubernetes scale subresource, kubectl scale and KEDA work without anything extra. There’s no in-cluster pod autoscaling.

Choosing a topology

TopologyUse whenHow you set it
Single-nodeThe model fits on one node’s GPUsOne Standalone member (the default)
Multi-nodeThe model is too large for one nodeA Leader and one or more Worker members, ideally with a modelCacheRef
Disaggregated servingLarge model, heavy load, strict latency, long contextserving.mode: PrefillDecode with two phase engines

Examples

model-deployment.yaml
# A ModelDeployment deploys a model to one or more inference clusters.
# The scheduler picks clusters by clusterSelector labels and nodeSelector
# device requests, gated on available nodes. Each matched cluster gets one
# ModelReplica.
#
# The control plane creates a unified OpenAI-compatible endpoint:
#   http://<gateway-address>/<namespace>/<name>/v1/chat/completions
apiVersion: modelplane.ai/v1alpha1
kind: ModelDeployment
metadata:
  name: qwen3-8b
  namespace: ml-team
spec:
  # Number of ModelReplicas to fan out to. Each replica is a complete
  # serving instance scheduled to one InferenceCluster.
  replicas: 1

  # Optional: restrict the scheduler to clusters with specific labels.
  # clusterSelector:
  #   matchLabels:
  #     modelplane.ai/region: us-central

  # Engines are an array of inference engines. This model is one engine, one
  # Standalone member, one pod - the simplest shape. The engine composes to a
  # Deployment fronted by a Service.
  engines:
  - name: qwen3-8b
    members:
    # A Standalone member is a single self-contained engine pod. Its template
    # carries the container named "engine" - the inference engine; its image,
    # command, and args pass through verbatim.
    - role: Standalone
      # The member's per-node device request: a list of DRA device requests
      # describing what each of the member's pods needs from its node. The
      # scheduler matches each against a candidate pool's InferenceClass
      # devices and pins the member to a pool that satisfies them. Each
      # request's CEL is real DRA CEL over a single device; quantity() and
      # semver() are helpers. claim: DRA devices also become requests in the
      # DRA ResourceClaim the serving pods claim GPUs through, so an engine
      # must declare the GPUs it needs.
      nodeSelector:
        devices:
        - name: gpu
          count: 1
          selectors:
          # Qwen3-8B fits comfortably on an L4; 20Gi selects one without
          # over-constraining. A larger model would ask for more memory or a
          # specific architecture here. This CEL is real DRA CEL: the scheduler
          # matches it against the pool's declared device, and DRA matches it
          # again against the GPU's ResourceSlice when it binds the claim.
          - cel: |
              device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("20Gi")) >= 0
      template:
        spec:
          containers:
          - name: engine
            image: vllm/vllm-openai:v0.23.0
            args:
            - "--model=Qwen/Qwen3-8B"
            - "--served-model-name=qwen"
            - "--reasoning-parser=qwen3"
            - "--default-chat-template-kwargs={\"enable_thinking\": false}"
            - "--enable-auto-tool-choice"
            - "--tool-call-parser=hermes"
model-deployment-multinode.yaml
# A ModelDeployment serving one model across two nodes.
#
# When a model is too large to fit on one node's GPUs, make an engine a gang:
# give it a Leader and a Worker member, whose worker.nodes expands to that many
# worker pods, one per node. The scheduler picks a cluster with a pool that has
# enough GPUs per node and enough nodes for the whole gang, and Modelplane
# composes a LeaderWorkerSet-backed serving instance on it. The worker joins the
# leader through $(MODELPLANE_LEADER_ADDRESS), which Modelplane injects.
#
# Multi-node engines require a ModelCache: every pod in the gang mounts it at
# /mnt/models. When a member brings its own command, Modelplane does not inject
# --model, so the leader points the engine at the mount explicitly.
#
# This shape (vLLM's native multiprocessing backend, TP within a node and PP
# across nodes) is the one validated serving Qwen3-Coder-480B; see
# examples/qwen3-coder/ for the full platform side.
apiVersion: modelplane.ai/v1alpha1
kind: ModelDeployment
metadata:
  name: qwen3-coder
  namespace: ml-team
spec:
  replicas: 1
  modelCacheRef:
    name: qwen3-coder
  engines:
  - name: qwen3-coder
    members:
    - role: Leader
      nodeSelector:
        devices:
        - name: gpu
          count: 8
          selectors:
          - cel: |
              device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("120Gi")) >= 0
      template:
        spec:
          containers:
          - name: engine
            image: vllm/vllm-openai:v0.23.0
            command:
            - /bin/sh
            - -c
            - >-
              exec vllm serve /mnt/models
              --served-model-name=qwen3-coder
              --tensor-parallel-size=8
              --pipeline-parallel-size=2
              --distributed-executor-backend=mp
              --nnodes=2 --node-rank=0
              --master-addr=$(MODELPLANE_LEADER_ADDRESS)
              --max-model-len=32768
              --port=8000
    - role: Worker
      worker:
        nodes: 1
      nodeSelector:
        devices:
        - name: gpu
          count: 8
          selectors:
          - cel: |
              device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("120Gi")) >= 0
      template:
        spec:
          containers:
          - name: engine
            image: vllm/vllm-openai:v0.23.0
            command:
            - /bin/sh
            - -c
            - >-
              exec vllm serve /mnt/models
              --served-model-name=qwen3-coder
              --tensor-parallel-size=8
              --pipeline-parallel-size=2
              --distributed-executor-backend=mp
              --nnodes=2 --node-rank=1
              --master-addr=$(MODELPLANE_LEADER_ADDRESS)
              --headless
              --max-model-len=32768