# Modelplane Documentation > Modelplane is the open source control plane for AI model serving. It extends Crossplane to manage AI inference across a fleet of GPU clusters. --- # Overview Source: https://docs.modelplane.ai/overview/ Modelplane is the open source control plane for AI inference. It's software you install and run in your own environment, and it orchestrates the models, serving stack, and infrastructure across cloud, neocloud, and on-premise. Modelplane supports running any model and any engine on any infrastructure, with the frontier-level serving topologies and performance the largest models demand, from a single GPU to disaggregated, multi-node deployments. Modelplane operates across the whole fleet: provisioning inference clusters, scheduling model deployments on compatible clusters, autoscaling model replicas across clusters, caching model weights across clusters, and routing across clusters. It's an active system that is always reconciling the fleet toward the state you declare. You install Modelplane on a Kubernetes cluster, which becomes the control cluster for your inference fleet. It's built on [Crossplane](https://crossplane.io) and fully integrates with your existing platform systems. {{< hint warning >}} Modelplane is under active development. We have opted to build the project in the open, collaborating with the broad AI inference community on integrations and capabilities. {{< /hint >}} ## Deploy a model Modelplane's API is declarative, designed for platform teams responsible for the inference infrastructure and developers deploying models on that infrastructure. Once a platform team has provisioned inference clusters and declared the available GPUs and networking fabric, an ML development team deploys a model with a declarative manifest: ```yaml {nocopy=true} apiVersion: modelplane.ai/v1alpha1 kind: ModelDeployment metadata: name: qwen-demo namespace: ml-team spec: replicas: 1 engines: - name: qwen members: - role: Standalone nodeSelector: devices: - name: gpu count: 1 selectors: - cel: device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("20Gi")) >= 0 template: spec: containers: - name: engine image: vllm/vllm-openai:v0.23.0 args: ["--model=Qwen/Qwen2.5-0.5B-Instruct"] ``` Modelplane schedules a model replica onto an inference cluster with free, compatible GPUs and memory, and deploys the serving engine. Exposing an OpenAI-compatible endpoint can be done by declaring a model service: ```yaml {nocopy=true} apiVersion: modelplane.ai/v1alpha1 kind: ModelService metadata: name: qwen namespace: ml-team spec: endpoints: - selector: matchLabels: modelplane.ai/deployment: qwen-demo ``` ## A universal control plane for AI inference Modelplane is designed to be a universal control plane for inference. It runs inference clusters on any cloud, neocloud, or on-premise environment, or any combination of them. Modelplane can provision the clusters for you, or you can bring your own. It supports any serving engine that runs as a container, and can serve frontier-quality models using advanced topologies including tensor parallel, pipeline parallel, data and expert parallel, and prefill/decode disaggregation. Modelplane works across different accelerators and networking fabrics, and schedules each model's replicas by matching the model's hardware requirements to the hardware available across your clusters. ## What Modelplane is not Modelplane is not a serving engine like vLLM, SGLang, or TensorRT-LLM. Modelplane composes serving engines and orchestrates them fleet-wide across cloud, neocloud, and on-premise. Modelplane is not a managed inference service like Baseten, Together, or Fireworks. These offer cloud services, while Modelplane is self-hosted software. ## Next steps {{< cardgroup cols="2" >}} {{< card title="Get started" href="/getting-started/" cta="Deploy on a real fleet" >}} Go from nothing to a live OpenAI-compatible endpoint in about 45 minutes. {{< /card >}} {{< card title="Why Modelplane" href="/overview/why/" cta="Learn more" >}} Learn more about Modelplane's capabilities and how it works. {{< /card >}} {{< /cardgroup >}} --- # Deploy a Model Source: https://docs.modelplane.ai/models/model-deployment/ **API:** [`modelplane.ai/v1alpha1` · ModelDeployment]({{< ref "/reference/modeldeployments" >}}) A `ModelDeployment` is the ML team's primary interface. You describe the model you want served, the hardware it needs, and how many copies to run; Modelplane schedules it onto matching clusters and keeps it running. You never name a cluster. Modelplane is unopinionated about the engine itself. You bring the container and its flags, and Modelplane shapes a serving topology around it. The engine flags you write carry parallelism, quantization, and KV transfer, never injected by Modelplane. A deployment's `spec.engines` describes its topology through two choices: - **One pod or a gang**: whether an engine is a single `Standalone` pod or a `Leader` with one or more `Worker` pods coordinating across nodes. - **Unified or disaggregated**: whether `spec.serving.mode` keeps prefill and decode together (`Unified`, the default) or splits them across two engines (`PrefillDecode`). How many of each to run is a separate question, covered in [Sizing a deployment](#sizing-a-deployment). ## Single-node The default, and what the [getting started tour]({{< ref "/getting-started" >}}) deploys. One `Standalone` member is one pod on one node, claiming that node's GPUs through its `nodeSelector`. It's usually the right choice when a model fits on a single node. Within a node, tensor parallelism is an engine flag (`--tensor-parallel-size`), not a Modelplane concept. ```yaml {nocopy=true} engines: - name: qwen members: - role: Standalone # one pod, one node ``` ## Multi-node When a model is too large for one node's GPUs, make the engine a gang: a `Leader` and a `Worker` whose `worker.nodes` expands to that many worker pods, one per node. The pods serve the model together; how the model splits across them (tensor, pipeline, data, or expert parallelism) is up to your engine flags. A gang should use a [`ModelCache`]({{< ref "model-cache.md" >}}) via `spec.modelCacheRef`, so every pod mounts the same weights instead of each pulling its own. ```yaml {nocopy=true} modelCacheRef: name: qwen3-coder # recommended for gangs engines: - name: qwen3-coder members: - role: Leader - role: Worker worker: nodes: 1 # one worker pod per node ``` A member's `env` can read pod fields through `valueFrom.fieldRef`, like setting vLLM's `VLLM_HOST_IP` from `status.podIP`, which multi-NIC RDMA nodes need so the engine binds the right interface instead of guessing it. ## Disaggregated serving The prefill and decode phases have opposite hardware profiles, and on one engine a prefill burst stalls the decodes already running. Set `spec.serving.mode: PrefillDecode` to run them as two engines, one marking `phase: Prefill` and the other `phase: Decode`. Modelplane fronts the pair with inference-aware routing that sequences prefill then decode, moving the KV cache between them. Each phase can sit on the GPU class that suits it. ```yaml {nocopy=true} serving: mode: PrefillDecode # the two engines below are one P/D pair engines: - name: prefill phase: Prefill - name: decode phase: Decode ``` Disaggregation pays off for large models under load with strict latency targets and long context. For small models or low traffic, the KV-transfer overhead outweighs the benefit, so unified serving is the default. It requires an engine image that includes the **NIXL** KV-transfer runtime. vLLM's `NixlConnector` (and SGLang's prefill/decode transfer) import the `nixl` package, so disaggregated engines crash at startup with `NIXL is not available` on an image that lacks it. Recent vanilla `vllm/vllm-openai` images include NIXL, so pin a current tag rather than an old one. The engine image is yours to choose, so this is a prerequisite Modelplane does not bundle for you. ## Requesting GPUs You don't name a cluster or a GPU model. Instead each member's `nodeSelector` lists the hardware its pods need, and Modelplane finds a node pool that has it. The platform team publishes node pools as `InferenceClass` resources, each describing the devices its nodes carry. Your request is matched against them. A request names a device (`gpu`), how many of it each pod needs (`count`), and one or more `selectors` the device must match: ```yaml {nocopy=true} nodeSelector: devices: - name: gpu count: 1 # one GPU per pod selectors: - cel: | device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("40Gi")) >= 0 ``` Each selector is a single line of [CEL](https://cel.dev/), a small expression language, that returns true or false for one device. The part in brackets, `"gpu.nvidia.com"`, is the GPU vendor's driver. The fields after it, like `memory` or `architecture`, are what the platform team published for that device. This one says "match a GPU whose memory is at least 40Gi." A device has to match every selector in the request. Give two selectors to mean "Hopper, with at least 80Gi." ### Requesting more than one device `devices` is a list, so a member can ask for distinct kinds of hardware at once, each its own entry with its own `count` and `selectors`. A node pool matches the member only when it satisfies every entry. This is how you ask for both a GPU and a fast NIC on the same node: ```yaml {nocopy=true} nodeSelector: devices: - name: gpu count: 8 selectors: - cel: device.attributes["gpu.nvidia.com"].architecture == "Hopper" - name: nic count: 1 selectors: - cel: device.attributes["nic.nvidia.com"].linkType == "infiniband" ``` ### What you can match on Each selector is evaluated against one device and must return a boolean. The device exposes three things: - `device.driver`: the device's driver, a string. - `device.attributes[""].`: a typed attribute (string, bool, int, or version), such as `architecture` or `cudaComputeCapability`. - `device.capacity[""].`: a capacity quantity, such as `memory`. Two helpers build comparable values: `quantity()` parses Kubernetes quantities like `"40Gi"`, and `semver()` parses versions like `"9.0.0"`. Both support `compareTo` (which orders two values), `isGreaterThan`, and `isLessThan`. Combine selectors with the usual CEL operators (`==`, `!=`, `>=`, `&&`, `||`). ```yaml {nocopy=true} selectors: # Capacity: at least 40Gi of GPU memory. >= 0 reads as "left is at least right". - cel: device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("40Gi")) >= 0 # Attribute equality: a specific architecture. - cel: device.attributes["gpu.nvidia.com"].architecture == "Hopper" # Version attribute: a minimum CUDA compute capability. - cel: device.attributes["gpu.nvidia.com"].cudaComputeCapability.isGreaterThan(semver("8.9.0")) # Driver: match any device from a given driver. - cel: device.driver == "gpu.nvidia.com" # Presence: only match a device that publishes a given domain. - cel: '"gpu.nvidia.com" in device.attributes' # Two conditions in one selector. - cel: | device.attributes["gpu.nvidia.com"].architecture == "Hopper" && device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("80Gi")) >= 0 ``` This is the Kubernetes DRA device selector expression surface. The Kubernetes-specific CEL extension libraries (such as regular expressions and IP address helpers) aren't available. Selectors in practice are attribute and capacity comparisons like those above. ### Seeing what's available To see what you can match against, list the classes the platform team has published and look at the devices each one declares: ```bash kubectl get inferenceclass kubectl describe inferenceclass gke-l4-1x-g2 ``` The `describe` output shows each device's driver, attributes (like `architecture`), and capacity (like `memory`), which are exactly the keys your selectors read. If a selector asks for something no published class offers, the deployment won't schedule. ## Sizing a deployment Three independent numbers control how many pods a deployment runs: - **`spec.replicas`** stamps out whole copies of the entire topology. Each replica is a complete serving instance, and replicas usually land on different clusters. This is the scaling axis (see [Scaling](#scaling)). - **`engines[].copies`** runs several identical copies of one engine within a replica, on the same cluster. It's a fixed number, sized once, never autoscaled. Copies make a replica more resilient within its cluster: a node failure drops one copy instead of taking the whole replica out of service. In disaggregated serving they also set the prefill-to-decode ratio. - **`worker.nodes`** sets how many nodes one gang spans: a `Leader` plus that many `Worker` pods. It's how big a single multi-node engine is. ## Scaling `spec.replicas` is the only scaling axis. Each replica is a complete, fixed-shape serving instance, so scaling adds or removes whole instances across the fleet. Because the deployment exposes the Kubernetes scale subresource, `kubectl scale` and KEDA work without anything extra. There's no in-cluster pod autoscaling. ## Choosing a topology | Topology | Use when | How you set it | |----------|----------|----------------| | Single-node | The model fits on one node's GPUs | One `Standalone` member (the default) | | Multi-node | The model is too large for one node | A `Leader` and one or more `Worker` members, ideally with a `modelCacheRef` | | Disaggregated serving | Large model, heavy load, strict latency, long context | `serving.mode: PrefillDecode` with two phase engines | ## Examples {{< tabs >}} {{< tab "Single-node" >}} {{< manifests "concepts/model-deployment.yaml" >}} {{< /tab >}} {{< tab "Multi-node" >}} {{< manifests "concepts/model-deployment-multinode.yaml" >}} {{< /tab >}} {{< /tabs >}} --- # Get started Source: https://docs.modelplane.ai/getting-started/ Modelplane is an open source control plane for AI inference. It separates two concerns: a platform team managing GPU capacity, and ML teams deploying models against it. Without it, every change on one side creates work for the other. When the platform team updates infrastructure, ML teams have to react. When model requirements change, the platform team gets a request. With Modelplane, the platform team publishes hardware without knowing what models will run on it. The ML team declares what a model needs without knowing what clusters exist. The control plane resolves it and keeps it current as both sides change. In this tour, you'll switch between provisioning infrastructure and declaring a model to see how they interact. By the end you'll have a GPU fleet across three regions and one OpenAI-compatible endpoint routing to a model served across two of them. This is not a production setup and takes around 45 minutes to run. ## What you'll build The platform team provisions a starter cluster and grows it to two A100 regions; the ML team serves a model on the L4, then scales it onto an A100, all behind one endpoint. {{< asciinema src="what-youll-build.cast" poster="npt:2:13" >}} ## Before you begin You'll need [kind](https://kind.sigs.k8s.io/), [kubectl](https://kubernetes.io/docs/tasks/tools/), and [Helm](https://helm.sh/docs/intro/install/) installed, plus an AWS or GCP account with permission to create clusters. Each step covers what it needs as you reach it. ## The tour 1. [Installation]({{< ref "getting-started/installation.md" >}}): stand up the Modelplane control plane. 2. [Build the platform]({{< ref "getting-started/build-the-platform.md" >}}): provision your first GPU cluster. 3. [Deploying a model]({{< ref "getting-started/deploying-a-model.md" >}}): serve a model and send it a request. 4. [Scale the platform]({{< ref "getting-started/scale-the-platform.md" >}}): grow to a multi-region fleet. 5. [Scale the model]({{< ref "getting-started/scale-the-model.md" >}}): serve the model from two regions behind one endpoint. First, follow the [Installation]({{< ref "getting-started/installation.md" >}}) guide. --- # Installation Source: https://docs.modelplane.ai/getting-started/installation/ The control plane is where everything in Modelplane runs. In this step you'll install it on a local kind cluster, using Crossplane for reconciliation and the Modelplane APIs. No cloud yet, that comes next. This step takes about five minutes. ## Prerequisites Install [kind](https://kind.sigs.k8s.io/), [kubectl](https://kubernetes.io/docs/tasks/tools/), and [Helm](https://helm.sh/docs/intro/install/) on your machine. {{< hint "note" >}} You can run your Modelplane control plane anywhere. This tour uses kind for illustration. {{< /hint >}} ## Install the control plane Crossplane provides the reconciliation engine and package management. Create the kind cluster and install it with Helm: ```bash kind create cluster --name modelplane ``` ```bash helm repo add crossplane-stable https://charts.crossplane.io/stable helm repo update crossplane-stable helm install crossplane crossplane-stable/crossplane \ --namespace crossplane-system --create-namespace \ --set "args={--enable-dependency-version-upgrades}" \ --wait ``` Apply the bootstrap resources. They grant Crossplane the permissions it needs to manage your cluster: ```shell kubectl apply -f {{< manifest-url "getting-started/prerequisites.yaml" >}} ``` {{< expand "Review the prerequisites manifest" >}} {{< manifests "getting-started/prerequisites.yaml" >}} {{< /expand >}} ## Install Modelplane The Modelplane Configuration adds the Modelplane APIs and the composition functions that reconcile them: {{< manifests "getting-started/configuration.yaml" >}} Wait until the configuration is healthy: ```bash kubectl wait configuration/modelplane --for=condition=Healthy --timeout=5m ``` ## Next step The control plane is running but has nothing to schedule against yet. In the next step, you'll [build the platform]({{< ref "getting-started/build-the-platform.md" >}}) to provision a GPU cluster and publish what hardware it offers. --- # Qwen3-8B Source: https://docs.modelplane.ai/examples/qwen3-8b/ An 8.2B dense chat model on a single NVIDIA L4. The smallest recipe: one `Standalone` engine, no cache, weights pulled straight from Hugging Face. This recipe was run end to end; the `InferenceClass` and `ModelDeployment` are the exact manifests from that run. Apply the platform side first, then the ML side. ## Platform {{< manifests "examples/qwen3-8b/inference-class.yaml" >}} {{< manifests "examples/qwen3-8b/inference-cluster.yaml" >}} ## Deployment {{< manifests "examples/qwen3-8b/model-deployment.yaml" >}} {{< manifests "examples/qwen3-8b/model-service.yaml" >}} --- # Set Up the Gateway Source: https://docs.modelplane.ai/platform/inference-gateway/ **API:** [`modelplane.ai/v1alpha1` · InferenceGateway]({{< ref "/reference/inferencegateways" >}}) The `InferenceGateway` sets up the control plane's front door: one unified, OpenAI-compatible address that every `ModelService` is exposed through, routing each request on to the inference cluster serving it. The `InferenceGateway` is a singleton: create exactly one, named `default`, on your Modelplane control plane. It fronts every inference cluster in the fleet, so you don't create one per cluster. The `backend` field selects which gateway runs it. `Traefik` is the only value today. On a cloud cluster with a native LoadBalancer controller, the gateway's `Service` gets an external address on its own. On kind or bare-metal, where there's no such controller, set `spec.traefik.loadBalancer: MetalLB` and give it an address pool in `spec.traefik.metallb.addressPool` so the gateway gets an IP. See the example below. Once the gateway is ready, read its external address from `status.address`: ```bash kubectl get ig default -o jsonpath='{.status.address}' ``` That address is the host of every `ModelService` URL (`http://
//`), so it's what you hand to ML teams. ## Example {{< manifests "concepts/inference-gateway.yaml" >}} --- # Why Modelplane Source: https://docs.modelplane.ai/overview/why/ Open-weight models are becoming the choice for organizations: they can be post-trained, including with reinforcement learning, to compete with frontier models, and they put cost, governance, and data sovereignty back under the organization's control. As they do, platform teams are increasingly asked to provide GPU inference to their ML and development teams the same way they already provide cloud infrastructure. ## Kubernetes is becoming the default orchestrator Kubernetes is rapidly becoming the default orchestrator for inference. The broader cloud-native community is investing heavily to make it a first-class platform for AI workloads, adding device-aware scheduling, multi-node inference, distributed serving, and accelerator management. The major open source inference projects are converging on it; among them are vLLM, SGLang, NVIDIA Dynamo, llm-d, Ray, Slurm, KubeAI, and Kueue. Neoclouds like Baseten and CoreWeave have standardized on Kubernetes for their own operations. Inside a single cluster, the open source stack is now strong. ## Inference is a fleet problem Inference, however, almost always runs across more than one cluster. Accelerator availability scatters capacity across hardware types, providers, and regions. Sovereignty and compliance pin workloads to specific locations. Operators run across multiple clouds and on-premise environments. Large clusters concentrate failure and risk, so fleets of smaller clusters are often preferable, and inference workloads don't bin-pack the way other workloads do. Inference grows into a fleet, and a new set of problems appears above any single cluster: - Deciding where each model runs across available capacity. - Optimizing placement across heterogeneous accelerators. - Failing over across clouds and regions. - Routing by cost, latency, and sovereignty requirements. - Provisioning new capacity as demand grows. - Caching and distributing model weights across the fleet. - Managing the lifecycle of models, clusters, and infrastructure as one system. Open source addresses pieces of this but none brings all the pieces together in a fleet-wide system of record that manages placement, caching, capacity, policy, and routing across an entire fleet. The labs, hyperscalers, and managed providers have all solved these problems in a proprietary way, but the open equivalent does not yet exist. ## Modelplane extends Kubernetes to manage the fleet Modelplane does for the fleet what Kubernetes does for the cluster. It's the open source control plane above your inference clusters across cloud, neocloud, and on-premise: it places model deployments, autoscales replicas, provisions and manages the infrastructure underneath, caches and distributes model weights, and routes inference through one unified gateway with fallback to managed providers. It turns "I need this model served" into a stable endpoint for any ML team. Modelplane composes these projects rather than replacing them, and stays neutral across models, accelerators, clouds, and serving stacks. It's built on [Crossplane](https://crossplane.io) and extends Kubernetes to manage inference at the fleet level. Modelplane is open source, Apache 2 licensed, and we plan to donate it to a neutral open source foundation later this year. {{< cardgroup cols="2" >}} {{< card title="How Modelplane works" href="/overview/how-it-works/" >}} The architecture, the resources, and what happens when you deploy a model. {{< /card >}} {{< card title="FAQ" href="/overview/faq/" >}} How Modelplane compares to cluster orchestrators and managed providers, and what it requires. {{< /card >}} {{< /cardgroup >}} --- # Build the platform Source: https://docs.modelplane.ai/getting-started/build-the-platform/ This is the platform team's side of Modelplane. You set up the gateway that fronts your models, give the control plane cloud credentials, and register your first GPU cluster: a hardware profile published as an `InferenceClass` and an `InferenceCluster` that offers it. In the next step, the ML team will create a model deployment that schedules against this capacity without knowing which cluster it runs on. ## Prerequisites {{< tabs >}} {{< tab "EKS" >}} - An AWS account with permissions to create EKS clusters, VPCs, and IAM roles - AWS access key ID and secret access key {{< /tab >}} {{< tab "GKE" >}} - A GCP account with permissions to create GKE clusters, VPCs, and IAM roles - A GCP service account JSON key {{< /tab >}} {{< /tabs >}} ## Set up the InferenceGateway The `InferenceGateway` installs Traefik Proxy and MetalLB on the control plane. Traefik routes inference traffic to model replicas. MetalLB assigns Traefik's `LoadBalancer` service an external IP on kind, which doesn't have a cloud load balancer. You need one named `default` per control plane. If you run the control plane on a cloud cluster with native `LoadBalancer` support, omit the `loadBalancer` field. {{< manifests "getting-started/inference-gateway.yaml" >}} Wait until the gateway is ready: ```bash kubectl wait --for=condition=Ready ig/default --timeout=5m ``` ## Configure cloud credentials Give the control plane credentials so it can provision clusters in your cloud account. {{}} {{< tab "EKS" >}} Create an AWS credentials file: {{< editCode >}} ```ini [default] aws_access_key_id = $@$@ aws_secret_access_key = $@$@ ``` {{< /editCode >}} Create a Kubernetes secret: {{< editCode >}} ```bash kubectl create secret generic aws-creds \ --from-file=credentials=$@$@ \ -n crossplane-system ``` {{< /editCode >}} Apply the `ClusterProviderConfig` referencing your secret: {{< manifests "getting-started/clusterproviderconfig-aws.yaml" >}} {{< /tab >}} {{}} Create a Kubernetes secret: {{< editCode >}} ```bash kubectl create secret generic gcp-creds \ --from-file=credentials=$@$@.json \ -n crossplane-system ``` {{< /editCode >}} Apply the `ClusterProviderConfig`, setting `projectID` to your GCP project: {{< manifests path="getting-started/clusterproviderconfig-gke.yaml" apply="false" >}} {{< editCode >}} ```bash curl -fsSL {{< manifest-url "getting-started/clusterproviderconfig-gke.yaml" >}} \ | sed 's/my-gcp-project/$@$@/' \ | kubectl apply -f - ``` {{< /editCode >}} {{< /tab >}} {{}} ## Publish hardware and register the cluster The `InferenceClass` describes a hardware profile and how to provision it. The `InferenceCluster` registers a cluster that offers it. Apply both: {{< tabs >}} {{< tab "EKS">}} {{< manifests "getting-started/eks/platform.yaml" >}} Modelplane provisions the cluster. This takes about 15 minutes: ```bash kubectl wait --for=condition=Ready ic/eks-us-east --timeout=20m ``` {{< /tab >}} {{< tab "GKE" >}} Apply the manifest, setting the cluster's `project` to your GCP project: {{< manifests path="getting-started/gke/platform.yaml" apply="false" >}} {{< editCode >}} ```bash curl -fsSL {{< manifest-url "getting-started/gke/platform.yaml" >}} \ | sed 's/my-gcp-project/$@$@/' \ | kubectl apply -f - ``` {{< /editCode >}} Modelplane provisions the cluster. This takes about 15 minutes: ```bash kubectl wait --for=condition=Ready ic/starter --timeout=20m ``` {{< /tab >}} {{< /tabs >}} {{< hint "note" >}} Modelplane is reconciling the infrastructure against the source of truth, the manifest you just applied. While you wait, Modelplane is creating the EKS or GKE cluster and its GPU node pool, then installing the inference stack with LeaderWorkerSet for multi-node serving, llm-d for inference-aware routing, Envoy Gateway for traffic management, and the storage class for model weights. This is the same reconciliation loop Crossplane uses to configure other infrastructure, extended to the inference layer. {{< /hint >}} Once the cluster is `Ready` the ML team can deploy a model on it. {{< hint "note" >}} A cloud GPU cluster costs money while it runs. To stop the tour and resume later, follow [Clean up]({{< ref "getting-started/clean-up.md" >}}). {{< /hint >}} ## Next step Now that the platform is provisioned, the ML team can [deploy a model]({{< ref "getting-started/deploying-a-model.md" >}}) by describing what the model needs, not the infrastructure. --- # Define Hardware Classes Source: https://docs.modelplane.ai/platform/inference-class/ **API:** [`modelplane.ai/v1alpha1` · InferenceClass]({{< ref "/reference/inferenceclasses" >}}) An `InferenceClass` is a tested recipe for a GPU node pool. It bundles: - **Devices**: the node's hardware as a list of Dynamic Resource Allocation (DRA) style devices, each with a driver, count, typed attributes, and capacity. The scheduler matches a member's `nodeSelector` against these devices, and GPUs bind to pods through DRA. - **Provisioning** (optional): how to create a node pool of this class on a specific cloud. Classes without provisioning are for existing clusters where the pool already exists. Different clouds and GPU types imply different classes. A GKE L4 pool is `gke-l4-1x-g2`. A bare-metal H100 pool is `h100-8x-byo` (no provisioning). ## Describing devices A class's `devices` follow Kubernetes [Dynamic Resource Allocation](https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/) (DRA), the mechanism modern Kubernetes uses to match GPUs to pods. Each device has a `driver` (the vendor that owns it, such as `gpu.nvidia.com`), a `count` (how many a node has), typed `attributes` (such as `architecture`), and `capacity` (quantities, such as `memory`). This mirrors the shape the GPU's DRA driver publishes on a real node, so what you declare here is what an ML team's `nodeSelector` matches against and what DRA binds at runtime. You author the attribute and capacity keys, and there's no fixed list. Pick the ones an ML team would reasonably select on, the GPU memory, the architecture, the compute capability, using the same names the driver reports. ## DRA and synthetic devices Each device sets a `claim` discriminator: - **`DRA`** (the default) is hardware a real DRA driver exposes, today GPUs. Modelplane both schedules against it and binds it to pods. - **`Synthetic`** is described for scheduling only, never claimed. Use it for hardware that matters for placement but has no DRA driver yet, like an InfiniBand fabric. ## The device contract The `driver`, attribute keys, and capacity keys a class declares are a contract with the ML team: a `ModelDeployment`'s `nodeSelector` matches a pool only if the class publishes the attributes and capacity it asks for. ML teams write those matches as [CEL](https://cel.dev/) selectors over the keys you publish here. For GPUs, these keys should mirror what the DRA driver reports, so the same selector that places a deployment on the pool also binds the right device. Publish a device's real usable capacity, not its nominal spec. An `80GB` H100 reports about `81559Mi` of usable memory, so a class that declares `80Gi` would let a `nodeSelector` asking for `>= 80Gi` match the pool but then fail to bind the GPU. ## Examples {{< tabs >}} {{< tab "GKE L4" >}} {{< manifests "concepts/inference-class-gke-l4.yaml" >}} {{< /tab >}} {{< tab "EKS L4" >}} {{< manifests "concepts/inference-class-eks-l4.yaml" >}} {{< /tab >}} {{< tab "H100 bare-metal" >}} {{< manifests "concepts/inference-class-h100-byo.yaml" >}} {{< /tab >}} {{< /tabs >}} --- # Expose a Model Source: https://docs.modelplane.ai/models/model-service/ **API:** [`modelplane.ai/v1alpha1` · ModelService]({{< ref "/reference/modelservices" >}}) A [`ModelDeployment`]({{< ref "model-deployment.md" >}}) serves a model, but its replicas are scattered across the fleet with no single address. A `ModelService` gives them one: a stable, unified, OpenAI-compatible URL that load-balances across every replica, wherever it runs. A service selects what to route to by label. Behind the scenes, Modelplane creates one `ModelEndpoint`, a single reachable backend, for each replica of a deployment and labels it. Two of those labels carry routing intent: - `modelplane.ai/deployment`: the deployment the replica belongs to. - `modelplane.ai/cluster`: the cluster the replica runs on. Modelplane creates an endpoint only once its replica is Ready, serving and reachable, and withdraws it if the replica later goes unhealthy. A service only ever routes to replicas that can actually answer, so a deployment that's still starting or scaling up has fewer endpoints behind its URL until those replicas come up. You don't create endpoints yourself. You point a service at them. `spec.endpoints` is a list, and the entries combine: the service routes to every endpoint that any entry matches. The patterns below build on that. ## Route to a whole deployment The common case: one selector matching a deployment's name reaches every replica, wherever in the fleet they run. ```yaml {nocopy=true} spec: endpoints: - selector: matchLabels: modelplane.ai/deployment: qwen3-8b # every replica of this deployment ``` ## Route to part of a deployment Add a second label to narrow within a deployment. A selector matches an endpoint only when all its labels match, so pairing the deployment with a cluster routes to just that cluster's replicas. This is how you take a cluster out of service without redeploying: point the service at the clusters you want and leave one out, and traffic drains to the rest. ```yaml {nocopy=true} spec: endpoints: # Only the replicas on prod-us-east, e.g. while draining another cluster. - selector: matchLabels: modelplane.ai/deployment: qwen3-8b modelplane.ai/cluster: prod-us-east ``` ## Route across several deployments Give more than one entry to front several deployments behind the same URL. Each entry contributes its matched endpoints, and traffic spreads evenly across every one. ```yaml {nocopy=true} spec: endpoints: - selector: matchLabels: modelplane.ai/deployment: qwen3-8b - selector: matchLabels: modelplane.ai/deployment: qwen3-8b-v2 ``` This is the shape an A/B test or a canary rollout would take, but note traffic is split **evenly** across the matched endpoints today. Weighting one entry over another, to send, say, 5% of traffic to a canary, is tracked in [#90](https://github.com/modelplaneai/modelplane/issues/90). Until then the split follows endpoint counts, not a ratio you set. The entries don't have to be deployments. One can select a manually created [ModelEndpoint]({{< ref "model-endpoint.md" >}}) that points at an external provider, so a service can send overflow or break-glass traffic to a SaaS endpoint alongside your own replicas: ```yaml {nocopy=true} spec: endpoints: - selector: matchLabels: modelplane.ai/deployment: kimi-k2 - selector: matchLabels: modelplane.ai/external-provider: together ``` Endpoints with different path layouts coexist behind the one URL. ## Sending a request The service's public address is on `status.address`, in the form `http:////`: ```bash ADDRESS=$(kubectl get ms qwen -n ml-team -o jsonpath='{.status.address}') ``` Append the OpenAI path and send a request. The `model` field is the name the engine serves (its `--served-model-name`, or the model's Hugging Face id if you didn't set one): ```bash curl "$ADDRESS/v1/chat/completions" \ -H "Content-Type: application/json" \ -d '{ "model": "qwen", "messages": [{"role": "user", "content": "Hello!"}] }' ``` ## Alternate APIs We call the endpoint OpenAI-compatible because the engines are, not because Modelplane imposes it. The route matches the `///` prefix and preserves the path below it on the way to the engine, so any API the engine serves is reachable on the same URL. Take a vLLM replica that also serves the Anthropic Messages API. It answers on `.../v1/messages`, so a client that speaks it (including Claude Code, via `ANTHROPIC_BASE_URL`) talks to it directly. The engine's operational paths come through the same way: `.../health` and the Prometheus `.../metrics` are reachable on the service URL. There's one exception, and it's set by the deployment rather than the service. [Disaggregated serving]({{< ref "model-deployment.md#disaggregated-serving" >}}) reads OpenAI-format request bodies to pick a prefill and decode worker, so a request in another API shape still reaches the engine but skips that cache-aware routing. Unified serving forwards every API shape the same way. ## Example {{< manifests "concepts/model-service.yaml" >}} --- # How it schedules Source: https://docs.modelplane.ai/architecture/scheduling/ **API:** [`modelplane.ai/v1alpha1` · ModelDeployment]({{< ref "/reference/modeldeployments" >}}) When an ML team creates a [ModelDeployment]({{< ref "/models/model-deployment.md" >}}), the fleet scheduler decides which cluster each replica runs on and which node pool each engine uses. Platform teams don't drive it directly, but what they publish, the clusters, their labels, and each pool's [InferenceClass]({{< ref "/platform/inference-class.md" >}}), is exactly what the scheduler matches against. This page explains how it places work and where it deliberately stops short, so you can reason about why a deployment landed where it did. ## A pure function of observed state The scheduler recomputes the whole placement from scratch on every reconcile. It reads the deployment, every `InferenceCluster` with its published capacity, and every existing `ModelReplica`, and returns a placement. Given the same inputs it returns the same placement, so it's safe to run continuously. The key consequence is stability. Existing replicas are *inputs*, not decisions. A healthy replica is never moved to improve the global picture, even if a better cluster appears later. This keeps placement from churning underneath a running deployment. ## Two-level matching The scheduler picks a `(cluster, pool)` for each replica in two stages, matching against what the platform team published. 1. **Clusters** are filtered by `clusterSelector.matchLabels` against the standard Kubernetes labels on each `InferenceCluster`: tier, region, provider, compliance posture. This is organizational metadata, so string equality is enough. An unset selector matches every cluster. 2. **Pools** are filtered by matching each device request in a member's `nodeSelector.devices` against the devices a pool's `InferenceClass` publishes. A request is a real DRA request: a `count` and CEL selectors over a device's attributes and capacity, such as "a GPU with at least 141Gi of memory." A pool fits a member when it has devices satisfying every request, with `count` to cover them. The CEL is the same expression an ML engineer would write in a DRA `ResourceClaim`, evaluated against the devices the `InferenceClass` declares. The keys a platform team puts on a class are the contract: a `nodeSelector` matches a pool only if the class publishes the attributes and capacity it asks for. ## Co-scheduling and pools A replica is a set of engines placed together on one cluster. Within a replica, every member of a single engine is placed on **one** pool: each member carries its own `nodeSelector`, but the scheduler requires a single pool that satisfies them all. It works this way because a gang's members coordinate over their pool's interconnect fabric, and the scheduler can't reason about fabric. Pool identity is the finest grain it has. An engine split across pools risks landing its members on different fabrics. The collective then never forms, and the gang hangs with no clear error. To avoid that, the scheduler never splits an engine: an engine that no single pool satisfies isn't scheduled on that cluster. Different engines of the same replica can use different pools, but all on the same cluster. ```mermaid graph TD subgraph cluster ["One InferenceCluster"] subgraph pool1 ["Pool A"] L["prefill engine\nLeader + Worker\n(whole gang, one pool)"] end subgraph pool2 ["Pool B"] D["decode engine\nStandalone"] end end R["ModelReplica"] --> L R --> D ``` A member with no `nodeSelector` claims no devices. It matches the engine's pool at no node cost and rides along on the gang's nodes, packed there by the cluster's own scheduler. ## Counting capacity in nodes Capacity is gated on **nodes**, not on individual GPUs. The only number the scheduler reads from a member is its node cost: ```text nodes = pods × copies pods = 1 for a Standalone or Leader, or worker.nodes for a Worker ``` A member that resolves no `claim: DRA` device, because it carried no `nodeSelector` or matched only synthetic devices, costs zero nodes. The scheduler sums the cost of a replica's members and places the replica only where every engine's pool has enough free nodes, tracking a running ledger so it never overcommits a cluster. This accounting is deliberately coarse. The control-plane scheduler answers "could this cluster plausibly host this replica," not "exactly which GPU does each pod get." Device-level contention between deployments is left to DRA admission on the workload cluster, which is authoritative: it rejects a pod whose `ResourceClaim` can't be satisfied, and the next reconcile sees the updated state. ## Pinning placement to a pool The scheduler's pool choice is enforced, not advisory. Each scheduled pod carries a Kubernetes `nodeSelector` on the `modelplane.ai/pool` node label, so it can only land on the pool the scheduler chose. Without it, the cluster's scheduler could place a pod on any pool whose devices match its DRA claim, and the fleet's per-pool accounting would drift from where pods actually run. Modelplane labels the nodes of every pool it provisions. On a BYO (`source: Existing`) cluster it doesn't provision the nodes, so the operator must label each pool's nodes `modelplane.ai/pool=` themselves, or worker pods for that pool stay `Pending`. ## Scaling, retention, and re-placement Scheduling runs in two phases each reconcile: - **Retain.** Each existing replica keeps its cluster if the cluster still exists and every member's pinned pool still matches its (possibly edited) `nodeSelector`. A degraded cluster, one that's not Ready or has no gateway address, is still retained; transient outages surface through the deployment's conditions, not re-placement. - **Fill.** If the deployment wants more replicas than were retained, the shortfall is placed one at a time, each onto the eligible cluster hosting the fewest of this deployment's replicas, spreading before packing. If it wants fewer, the highest-index replicas are dropped first. A replica never changes cluster. If its cluster is deleted, the replica stops being emitted, Crossplane garbage-collects it, and the fill phase mints a fresh replica elsewhere. Moving is always delete-plus-create, mirroring how Kubernetes treats a pod whose node is gone. ## Known limitations The scheduler is built to be conservative and predictable rather than optimal. Two limits follow from that, both tracked for future work: - **A whole node is charged per pod** ([#172](https://github.com/modelplaneai/modelplane/issues/172)). A pod that claims one GPU of an eight-GPU node still charges the whole node in the scheduler's accounting. This is safe, it can only under-count a pool's capacity, never overcommit it, but it can strand GPUs on deployments of sub-node engines. - **An engine can't span pools, even on one fabric** ([#149](https://github.com/modelplaneai/modelplane/issues/149)). Because the scheduler has no concept of fabric, it refuses to split a gang across pools at all. That forecloses a legitimate case, GPU workers on one pool and a no-GPU coordinator on another within the same fabric, until fabric-aware placement lands. --- # How Modelplane works Source: https://docs.modelplane.ai/overview/how-it-works/ Modelplane runs as a control plane on its own cluster, the **control cluster**, above the **inference clusters** that actually serve models. It's built on [Crossplane](https://crossplane.io): platform teams and developers describe what they want as Kubernetes resources, and Modelplane continuously reconciles the fleet to match, composing the clusters, scheduling replicas, and exposing endpoints. This page is the full tour. It covers the architecture and resources, then walks through what happens when you deploy a model. ## Modelplane API Modelplane's API is two sets of resources, one per team, with everything in between filled in for you. Platform teams describe the fleet, ML teams describe a model, and Modelplane composes the rest. The hierarchy mirrors Kubernetes core one scope up: `ModelDeployment` → `ModelReplica` → `ModelService` → `ModelEndpoint` parallels `Deployment` → `Pod` → `Service` → `Endpoint`, across a fleet instead of within a single cluster. ## What the control plane reconciles Once the resources exist, Modelplane keeps the fleet matching them. Five concerns run continuously: 1. **Provisioning.** From an `InferenceCluster`, Modelplane creates a full cluster and its GPU node pools, or brings in a cluster you already run on any Kubernetes, and installs the serving stack on each. 2. **Scheduling.** A two-level scheduler places work: it pins each `ModelReplica` to a cluster and pool whose hardware meets the model's requirements, then the cluster's own scheduler binds the GPUs to the serving pods through DRA. 3. **Autoscaling.** Replicas are the scaling axis. Scaling a `ModelDeployment`'s `spec.replicas` adds or removes whole serving instances through the standard Kubernetes scale subresource, so `kubectl scale` or a KEDA `ScaledObject` work out of the box. 4. **Routing.** A `ModelService` exposes one OpenAI-compatible endpoint through the gateway and load-balances across the deployment's `ModelEndpoints`, wherever their replicas run. `ModelEndpoints` can also point at external inference services. 5. **Caching.** A `ModelCache` stages model weights on cluster storage once, so serving pods read them locally instead of re-downloading on every start. ## Universal compatibility Modelplane is deliberately unopinionated about the engine. A `ModelDeployment` describes the *shape* of a deployment, how many pods, on how many nodes, with which devices, and nothing about how the engine runs internally. The engine flags you write carry parallelism (tensor, pipeline, data, expert), quantization, and KV transfer; Modelplane never injects them. This is what lets one API serve any container-based engine and any topology without special cases. Modelplane composes the engine onto the right cluster resource and injects almost nothing, just the address a multi-node leader is reachable at, so a worker can join it. New engines and new parallelism strategies work without a change to Modelplane. The community publishes recipes (worked, copyable manifests) to bridge the gap that flexibility leaves, rather than hard-coding choices into the API. ## Fleet scheduler For each replica, the scheduler picks a `(cluster, pool)` in two steps: 1. **Filter clusters** by `clusterSelector.matchLabels` against the standard Kubernetes labels on each `InferenceCluster`, the organizational metadata: tier, region, provider, compliance posture. 2. **Filter pools** by matching each device request in the deployment's `nodeSelector.devices` against the pool's `InferenceClass`. A request is based on DRA: a `count` and CEL selectors over a device's attributes and capacity, like "a GPU with at least 141Gi of memory." A pool fits when it has the devices the model asks for and enough free nodes to hold a replica. Capacity is accounted at the node level across the fleet, so Modelplane never overcommits a pool. Replicas are pinned to their cluster once placed and stay there across reconciles; if a cluster is deleted, the scheduler re-places its replicas elsewhere. [How it schedules]({{< ref "/architecture/scheduling.md" >}}) covers the placement rules and their limits in full. ## Deploying a model Creating a `ModelDeployment` kicks off the loop end to end. The scheduler discovers the ready clusters (filtered by your label selector if you set one), matches each engine's device requests against their pools, and pins each replica to a cluster that fits. Modelplane composes a `ModelReplica` on each chosen cluster, turns it into the right serving workload there, creates a `ModelEndpoint` per replica, and your `ModelService` routes traffic across them through one stable endpoint on the gateway. Scale the deployment up or down and the same loop re-converges. ## Serving topologies A single-node deployment composes to a Kubernetes Deployment fronted by a service. When a model is too large for one node, an engine becomes a gang: a `Leader` member and one or more `Worker` members that Modelplane composes into a LeaderWorkerSet, serving the model together across nodes. Gang deployments should stage their weights through a `ModelCache`, so the pods share one copy instead of each pulling the same model. Disaggregated serving splits prefill and decode into separate engines (`serving.mode: PrefillDecode`) that run on the same cluster and hand off the KV cache between them. Modelplane wires up the cluster-edge routing that pairs each request's prefill and decode; the engines carry the KV-transfer flags. Both are described in full in the [model deployment docs]({{< ref "/models/model-deployment" >}}). ## Next steps {{< cardgroup cols="2" >}} {{< card title="FAQ" href="/overview/faq/" >}} Quick answers on how Modelplane compares and what it requires. {{< /card >}} {{< card title="Get started" href="/getting-started/" >}} Put it together: deploy Modelplane and serve a model. {{< /card >}} {{< /cardgroup >}} --- # Qwen3-Coder-480B Source: https://docs.modelplane.ai/examples/qwen3-coder/ A 480B code MoE (35B active). Two validated shapes: the BF16 weights span two H200 nodes as a gang over EFA, served from a `ModelCache`; the FP8 checkpoint fits one node, so it runs as a single `Standalone` engine on SGLang with no cache. Both shapes were run end to end; the `InferenceClass` and `ModelDeployment` are the exact manifests from those runs. Apply the platform side first, then the ML side. The `InferenceCluster` carries an EC2 capacity reservation placeholder to edit before applying. ## Platform {{< tabs >}} {{< tab "Multi-node (BF16)" >}} {{< manifests "examples/qwen3-coder/inference-class.yaml" >}} {{< manifests path="examples/qwen3-coder/inference-cluster.yaml" apply="false" >}} {{< editCode >}} ```bash curl -fsSL {{< manifest-url "examples/qwen3-coder/inference-cluster.yaml" >}} \ | sed 's/cr-0123456789abcdef0/$@$@/' \ | kubectl apply -f - ``` {{< /editCode >}} {{< /tab >}} {{< tab "Single-node (FP8)" >}} {{< manifests "examples/qwen3-coder/inference-class-fp8.yaml" >}} {{< /tab >}} {{< /tabs >}} ## Deployment {{< tabs >}} {{< tab "Multi-node (BF16)" >}} {{< manifests "examples/qwen3-coder/model-cache.yaml" >}} {{< manifests "examples/qwen3-coder/model-deployment.yaml" >}} {{< manifests "examples/qwen3-coder/model-service.yaml" >}} {{< /tab >}} {{< tab "Single-node (FP8)" >}} {{< manifests "examples/qwen3-coder/model-deployment-fp8.yaml" >}} {{< manifests "examples/qwen3-coder/model-service-fp8.yaml" >}} {{< /tab >}} {{< /tabs >}} --- # Cache Model Weights Source: https://docs.modelplane.ai/models/model-cache/ **API:** [`modelplane.ai/v1alpha1` · ModelCache]({{< ref "/reference/modelcaches" >}}) A `ModelCache` stages a model's weights on shared workload-cluster storage, fetched once from the configured source rather than downloaded again on every pod start. `ModelDeployments` reference a cache via `spec.modelCacheRef.name`, and Modelplane mounts it at `/mnt/models` in every serving pod, shared across the pods of a multi-node engine. The engine reads weights locally from the mount. `ModelCache` is recommended for multi-node deployments and optional for single-node cold-start optimization. ## What to cache The required `source` enum names the kind, with the matching source object set alongside it. Setting `source: HuggingFace` selects `spec.huggingFace`, which carries the `repo` to fetch, an optional `revision` (branch, tag, or commit), and `sizeGiB`, how much storage the weights get on each cluster. Size it to the model, since a value below the model's size leaves no room to stage the weights. `HuggingFace` is the only source today. The cache mounts at `/mnt/models` on every consuming pod, so the engine's args reference that path (`--model=/mnt/models` for vLLM) rather than the source. ## Authenticating A gated or private model needs a credential to fetch. When a cache stages the weights, the credential lives on the cache: set `authSecret` to name a Secret in the cache's namespace, and Modelplane propagates it to every cluster the cache stages to, for the hydration to read. Create the Secret once on the control plane, then reference it: ```bash kubectl create secret generic hf-token \ --namespace ml-team \ --from-literal=HF_TOKEN=hf_xxxxxxxx ``` ```yaml {nocopy=true} spec: source: HuggingFace huggingFace: repo: Qwen/Qwen3-Coder-480B-A35B-Instruct authSecret: name: hf-token # a Secret in this ModelCache's namespace key: HF_TOKEN # defaults to HF_TOKEN sizeGiB: 1100 ``` Without a cache, the engine fetches the model itself at startup, so the credential goes on the `ModelDeployment` instead, as `HF_TOKEN` in the engine container's `env`. ## Where to cache An optional `clusterSelector` scopes where the cache is staged. Omitting it stages the cache on every cluster in the fleet; setting `matchLabels` restricts it to clusters carrying those labels. A `ModelDeployment` that references the cache places *new* replicas only onto clusters within this footprint, so narrowing the selector also narrows where replicas can land: a replica never schedules to a cluster the cache didn't stage to. Replicas already running are left where they are. ## Loading from cache A cache only pays off if the engine reads from it quickly. With its default loader an engine can read a large model from shared storage slowly enough that the cache makes cold starts *worse* than fetching the model directly, since you pay to hydrate the cache and then wait on a slow read. Choose a fast loader with your engine flags. For vLLM on EKS, `--load-format=runai_streamer` reads from the EFS-backed cache dramatically faster than the default loader (minutes rather than tens of minutes for a large model), tuned further with `--model-loader-extra-config`: ```yaml {nocopy=true} args: - --model=/mnt/models - --load-format=runai_streamer - --model-loader-extra-config={"concurrency":16,"distributed":true} ``` The right loader and settings depend on the engine and the storage backend, so treat these as a starting point and measure your own cold-start time. The [Kimi-K2 example]({{< ref "/examples/kimi-k2" >}}) uses this configuration end to end. ## Storage prerequisites The cache PVC needs a `ReadWriteMany` (RWX) StorageClass on the workload cluster. What the platform admin must set up depends on the cloud: - **GKE** and **EKS:** auto-provisioned. Nothing for the admin to do. - **Existing:** the admin sets up a `ReadWriteMany` StorageClass on the cluster. Either way, your `ModelCache` and `ModelDeployment` specs are the same. How storage is provided on each cluster source, and how to bring your own backend, is covered in [Register a Cluster]({{< ref "/platform/inference-cluster.md#cache-storage" >}}). ## Example {{< manifests "concepts/model-cache.yaml" >}} --- # Deploying a model Source: https://docs.modelplane.ai/getting-started/deploying-a-model/ Now that the platform is provisioned, the ML team can declare what a model needs with a `ModelDeployment`. Describe the hardware requirements and the scheduler schedules against the capacity the platform team published. ## Create a deployment Create a namespace for the model: ```bash kubectl create namespace ml-team ``` The device selector matches against the capacity declared in the `InferenceClass`, not the pod's resource requests. Any L4 node satisfies `>= 20Gi`, so this deployment runs on the cluster you just added: {{< tabs >}} {{< tab "EKS" >}} {{< manifests "getting-started/eks/model-deployment.yaml" >}} {{< /tab >}} {{< tab "GKE" >}} {{< manifests "getting-started/gke/model-deployment.yaml" >}} {{< /tab >}} {{< /tabs >}} Wait until `REPLICAS` shows `1`: ```bash kubectl get md -n ml-team --watch ``` To see which cluster the scheduler chose: ```bash kubectl get modelreplica -n ml-team ``` ```shell{nocopy=true} NAME CLUSTER SYNCED READY COMPOSITION AGE qwen-demo-7323a eks-us-east True True modelreplicas.modelplane.ai 12m ``` The ML team never named a cluster. The scheduler matched the GPU requirement (`>= 20Gi`) against the `InferenceClass` the platform team published and made the placement. ## Expose the model A `ModelService` selects `ModelEndpoints` by label and creates a Gateway API `HTTPRoute` that routes to them. Modelplane creates one `ModelEndpoint` per replica, labeled with the deployment name: {{< manifests "getting-started/model-service.yaml" >}} The request path is `///...` (`/ml-team/qwen/` in this example), from the `ModelService` named `qwen`. The `model` field in the request body is the Hugging Face id `Qwen/Qwen2.5-0.5B-Instruct`, since this deployment doesn't set `--served-model-name`. ## Send a request Read the endpoint's public address from the `ModelService` status: ```bash ADDRESS=$(kubectl get ms qwen -n ml-team -o jsonpath='{.status.address}') ``` Send a request to it: ```bash kubectl run -i --rm curl-test \ --image=curlimages/curl \ --restart=Never \ --env="ADDRESS=$ADDRESS" \ -- sh -c 'curl -v "$ADDRESS/v1/chat/completions" \ -H "Content-Type: application/json" \ -d "{\"model\":\"Qwen/Qwen2.5-0.5B-Instruct\",\"messages\":[{\"role\":\"user\",\"content\":\"What is Kubernetes in one sentence?\"}],\"max_tokens\":100}"' ``` The request routes to the replica on the cluster Modelplane placed it on. You should get a response in a few seconds: ```json {nocopy=true} { "id": "chatcmpl-c88b1429-067d-40a5-971c-ab9c54153c26", "model": "Qwen/Qwen2.5-0.5B-Instruct", "choices": [ { "message": { "role": "assistant", "content": "Kubernetes (K8s) is an open-source platform for automating the deployment, scaling, and management of containerized applications. It provides scalable orchestration capabilities that enable developers to deploy complex applications quickly and efficiently across various environments." }, "finish_reason": "stop" } ], "usage": { "prompt_tokens": 37, "completion_tokens": 48, "total_tokens": 85 } } ``` ## Next step The platform team declared capacity and in this guide the ML team deployed a model behind a stable endpoint. Neither team needed to know what the other was doing. Modelplane matched them. In the next step, the platform team grows the fleet. [Scale the platform]({{< ref "getting-started/scale-the-platform.md" >}}) to add more clusters across regions. --- # Kimi-K2 Source: https://docs.modelplane.ai/examples/kimi-k2/ A 1T MoE (1 trillion parameters) served prefill/decode disaggregated across two H200 nodes: two engines, one per phase, with Modelplane composing the llm-d routing layer between them. This recipe serves an INT4 quantization of the model; the native FP8 weights need four such nodes. This recipe was run end to end; the `InferenceClass` and `ModelDeployment` are the exact manifests from that run. Apply the platform side first, then the ML side. The `InferenceCluster` carries an EC2 capacity reservation placeholder to edit before applying. ## Platform {{< manifests "examples/kimi-k2/inference-class.yaml" >}} {{< manifests path="examples/kimi-k2/inference-cluster.yaml" apply="false" >}} {{< editCode >}} ```bash curl -fsSL {{< manifest-url "examples/kimi-k2/inference-cluster.yaml" >}} \ | sed 's/cr-0123456789abcdef0/$@$@/' \ | kubectl apply -f - ``` {{< /editCode >}} ## Deployment {{< manifests "examples/kimi-k2/model-cache.yaml" >}} {{< manifests "examples/kimi-k2/model-deployment.yaml" >}} {{< manifests "examples/kimi-k2/model-service.yaml" >}} --- # Register a Cluster Source: https://docs.modelplane.ai/platform/inference-cluster/ **API:** [`modelplane.ai/v1alpha1` · InferenceCluster]({{< ref "/reference/inferenceclusters" >}}) An `InferenceCluster` represents a Kubernetes cluster configured for model serving. Platform teams create these to provide GPU capacity. Each cluster has: - A **cluster source**: `GKE` or `EKS` (Modelplane provisions the full cluster) or `Existing` (bring a cluster you manage yourself). See [Supported Providers]({{< ref "platform/providers.md" >}}) for the clouds and neoclouds Modelplane runs on. - One or more **node pools**, each referencing an `InferenceClass` for its hardware capabilities and provisioning recipe. - **Labels** for organizational metadata: tier, region, provider. These are the matching surface for `ModelDeployment.clusterSelector`. Modelplane installs the serving stack it needs on every cluster it manages, including existing clusters, which it assumes are solely for its use. ## Ownership and requirements Modelplane assumes exclusive ownership of every `InferenceCluster`. The fleet scheduler's capacity accounting relies on Modelplane being the only thing placing GPU workloads on the cluster, so dedicate each cluster to Modelplane rather than sharing it with other workloads. Modelplane also has opinions about how a cluster is set up: its Kubernetes version, the components it installs, and required features like DRA for binding GPUs to pods. On provisioned clusters Modelplane handles this for you. On an existing cluster the platform team must meet the requirements. ## Provisioned and existing clusters The `cluster.source` discriminator picks one of two models: - **Provisioned (`GKE`, `EKS`).** Modelplane creates the cluster and its GPU node pools from each pool's `InferenceClass`, labels the pool's nodes so the scheduler's placement is enforced, and provisions the storage class for model weights. It also injects a non-GPU **system pool** with opinionated defaults to run the inference stack, so you only declare the GPU pools you want. - **Existing (`Existing`).** A kubeconfig `Secret` provides access to a cluster you run yourself. Modelplane installs the serving stack it needs but doesn't provision infrastructure, and each pool's `InferenceClass` provides hardware capabilities for scheduling only. You're responsible for the cluster meeting Modelplane's requirements, including labeling each pool's nodes `modelplane.ai/pool=` (see [how scheduling pins placement]({{< ref "/architecture/scheduling.md#pinning-placement-to-a-pool" >}})). ## Examples {{< tabs >}} {{< tab "GKE" >}} {{< manifests path="concepts/inference-cluster-gke.yaml" apply="false" >}} {{< /tab >}} {{< tab "EKS" >}} {{< manifests path="concepts/inference-cluster-eks.yaml" apply="false" >}} {{< /tab >}} {{< tab "Existing" >}} {{< manifests path="concepts/inference-cluster-existing.yaml" apply="false" >}} {{< /tab >}} {{< /tabs >}} ## Cache storage A [ModelCache]({{< ref "/models/model-cache.md" >}}) stages model weights on a `ReadWriteMany` (RWX) StorageClass on the workload cluster. Where that comes from depends on the source: - **`GKE`** (Filestore Enterprise) and **`EKS`** (EFS): auto-provisioned. Those classes are fixed; nothing for the admin to do. - **`Existing`**: bring your own. Create an RWX StorageClass on the cluster, with any backend that supports automatic PVC provisioning (WekaIO, NetApp Trident, `FSx` for NetApp, and similar), and name it in `cluster.existing.cache.storageClassName`. The ML team's `ModelCache` and `ModelDeployment` specs are the same regardless of which backing storage a cluster uses. --- # FAQ Source: https://docs.modelplane.ai/overview/faq/ Short answers to the questions that come up first, with links to the full treatment. If you're new here, read the [Introduction]({{< ref "/overview" >}}) and [How Modelplane works]({{< ref "/overview/how-it-works" >}}) first. ## What Modelplane is {{< qa "Is Modelplane a serving engine like vLLM?" >}} No, Modelplane is the control plane *above* the engine. It composes serving engines like vLLM, SGLang, and NVIDIA TensorRT-LLM, and operates them across a fleet of clusters. It doesn't serve tokens itself. You bring the engine; Modelplane schedules it, routes to it, scales it, and caches its weights across your inference fleet. {{< /qa >}} {{< qa "Does Modelplane replace vLLM or SGLang?" >}} No, they run the model; Modelplane runs the fleet. A `ModelDeployment` carries your engine container and its flags, and Modelplane composes it onto the right cluster. Switching or upgrading engines is a change to your deployment, not to Modelplane. {{< /qa >}} {{< qa "How is Modelplane different from KServe or NVIDIA Dynamo?" >}} Scope. KServe and Dynamo are cluster orchestrators: they schedule, scale, route, and cache within a single Kubernetes cluster. Modelplane runs its operations across a fleet of clusters, clouds, and regions. Modelplane uses llm-d for multi-node serving, and KV-cache management, as do KServe and Dynamo. Modelplane is planning deeper integrations with NVIDIA Dynamo in future releases. {{< /qa >}} {{< qa "How is Modelplane different from a managed provider like Baseten or Fireworks?" >}} Managed providers run fleet-scale serving inside their own closed platform. Modelplane is the open equivalent that runs in infrastructure you own. The difference is open, in your own infrastructure, community-driven, and neutral across the stack, not scope. You can still route to a managed provider from Modelplane. {{< /qa >}} ## What it supports {{< qa "What models does Modelplane support?" >}} Modelplane supports any model, including open weights, custom models, and just about anything that can be downloaded from Hugging Face, NVIDIA NGC, and other registries. {{< /qa >}} {{< qa "Does Modelplane support NVIDIA?" >}} Yes, across the stack. NVIDIA is the most widely available accelerator on the clouds Modelplane runs on and the primary target today. Modelplane binds NVIDIA GPUs to pods through Dynamic Resource Allocation (DRA), matching devices by attributes such as GPU memory and architecture with CEL selectors. The software stack rides on the engine-agnostic API. NVIDIA NIM microservices and the TensorRT-LLM engine run as engine containers like any other, Modelplane stages weights and NIM-style artifacts from NVIDIA NGC alongside Hugging Face and other registries, and the inference stack it installs includes NVIDIA Dynamo and llm-d, with deeper Dynamo integration on the roadmap. {{< /qa >}} {{< qa "Which engines and accelerators are supported?" >}} The API is engine-agnostic: any engine that runs as a container works, and its flags are yours to write. Multiple accelerators are supported as long as they can be bound through DRA, and the device model (DRA plus CEL selectors) is built to extend to other accelerators and fabrics. {{< /qa >}} {{< qa "Which clouds or neoclouds does Modelplane support?" >}} Today Modelplane provisions clusters on a few hyperscalers and neoclouds, and supports bringing your own Kubernetes cluster anywhere. More provisioners are on the roadmap; the bring-your-own path means you can run on any Kubernetes now. See [Supported Providers]({{< ref "platform/providers.md" >}}) for the full matrix of clouds, neoclouds, and their Crossplane providers. {{< /qa >}} {{< qa "Can I bring my own cluster, or run on a neocloud or on-premise?" >}} Yes, an `InferenceCluster` with `source: Existing` registers a cluster you already run, through its kubeconfig. Modelplane installs the serving stack it needs but doesn't provision the infrastructure. This is how you run on neoclouds and on-premise today. {{< /qa >}} ## What it requires {{< qa "Where does Modelplane run?" >}} Modelplane runs as a control plane on a control cluster: an ordinary Kubernetes cluster with Crossplane installed, with no GPUs of its own. The inference clusters it manages do the serving, and each needs Dynamic Resource Allocation (DRA, Kubernetes v1.35+) to bind GPUs to pods. Modelplane assumes exclusive ownership of every inference cluster, so dedicate each one to Modelplane rather than sharing it with other workloads. {{< /qa >}} {{< qa "Do I need Crossplane?" >}} Yes, Modelplane is built on [Crossplane](https://crossplane.io) and requires it. If your platform team already runs Crossplane to manage cloud infrastructure, Modelplane is the same pattern applied to inference. Modelplane uses Crossplane's function framework and shares its infrastructure providers. {{< /qa >}} ## What it can do {{< qa "How does Modelplane decide where a model runs?" >}} Two-level matching. First it filters clusters by their labels (tier, region, provider) against your `clusterSelector`. Then it filters node pools by matching your device requests, real DRA requests with CEL selectors over GPU memory, architecture, and other attributes, against each pool's `InferenceClass`. It places each replica on a cluster and pool that fits and has free capacity. {{< /qa >}} {{< qa "Can I serve across regions and clusters behind one endpoint?" >}} Yes, that's the point. A `ModelService` exposes one OpenAI-compatible endpoint and load-balances across every replica of a deployment, wherever they run. {{< /qa >}} {{< qa "Can I route to a managed provider?" >}} Yes, a `ModelService` can include a manually created `ModelEndpoint` that points at an external SaaS endpoint like Together or Baseten alongside your self-hosted replicas, and load-balances across all of them. {{< /qa >}} {{< qa "How do large or multi-node models work?" >}} An engine can be a gang: a leader and one or more workers that Modelplane composes into a LeaderWorkerSet across nodes. You write the coordination (like Ray or vLLM's data-parallel coordinator) in the engine flags, and Modelplane injects the leader's address so the workers can join it. Multi-node deployments stage weights through a `ModelCache`. {{< /qa >}} {{< qa "What about disaggregated prefill/decode?" >}} Set `serving.mode: PrefillDecode` and define separate prefill and decode engines. Both run on the same cluster, hand off the KV cache over a fast fabric, and Modelplane configures the cluster-edge routing that pairs each request. The KV-transfer flags live in your engine config. {{< /qa >}} {{< qa "How does scaling work?" >}} Replicas are the only scaling axis. Each replica is a complete serving instance; scaling `spec.replicas` adds or removes whole instances across the fleet. Because a `ModelDeployment` exposes the Kubernetes scale subresource, `kubectl scale` and KEDA work without anything extra. There's no per-pod autoscaling inside a cluster. {{< /qa >}} {{< qa "How are model weights handled?" >}} A `ModelCache` stages weights once per cluster on shared (ReadWriteMany) storage, and every pod reads them locally. Pods don't re-download on each start, and concurrent starts don't race. It hydrates from Hugging Face today, is optional for single-node deployments, and is recommended for multi-node ones. {{< /qa >}} ## The project {{< qa "Why did you pick Modelplane as a name for the project?" >}} It's a fusion of AI Model and Control Plane. We also like that it implies that AI models are their own layer (or plane) in the stack. {{< /qa >}} {{< qa "What does the logo signify?" >}} Three popsicle sticks assembled to make a model plane. Balsa wood planes were the inspiration. {{< /qa >}} {{< qa "Is Modelplane production-ready?" >}} Modelplane is in early development and moving fast. Treat it as early software. The [platform docs]({{< ref "/platform" >}}) are specific about what's available today versus what's planned. We are building it in the open. {{< /qa >}} {{< qa "What's the license and governance?" >}} Modelplane is [Apache 2.0](https://github.com/modelplaneai/modelplane/blob/main/LICENSE), with no usage caps or token metering, and is developed in the open. It's neutral across models, engines, accelerators, and clouds, and is intended for donation to a neutral open source foundation. It's a project from Upbound, the team behind Rook and Crossplane, both CNCF Graduated and widely adopted projects. {{< /qa >}} {{< qa "How do I get involved?" >}} Issues, discussions, and contributions are welcome on [GitHub](https://github.com/modelplaneai/modelplane). See `CONTRIBUTING.md` for development setup and the project's conventions. {{< /qa >}} ## Next steps {{< cardgroup cols="2" >}} {{< card title="Get started" href="/getting-started/" >}} Deploy Modelplane and serve your first model. {{< /card >}} {{< card title="How Modelplane works" href="/overview/how-it-works/" >}} The architecture and the control loop, in one page. {{< /card >}} {{< /cardgroup >}} --- # Glossary Source: https://docs.modelplane.ai/overview/glossary/ ## Modelplane The open source control plane software. You install Modelplane on a Kubernetes cluster (the **control cluster**). Modelplane never serves tokens itself; it orchestrates the clusters and engines that do. ## Control cluster The Kubernetes cluster where Modelplane runs. It needs no GPUs. It holds Modelplane's Crossplane-based components and the API resources you apply to declare your fleet. ## Inference cluster A GPU cluster in the fleet where serving engines run and tokens are produced. Modelplane can provision inference clusters on EKS, GKE, and other providers, or you can bring your own through an `InferenceCluster` with `source: Existing`. ## Fleet All inference clusters managed by a single Modelplane control cluster. ## Platform The inference infrastructure the platform team provisions using `InferenceGateway`, `InferenceClass`, and `InferenceCluster` resources. This is distinct from Modelplane itself, which runs on the control cluster above the fleet. ## Platform team The infrastructure team responsible for GPU capacity. They create `InferenceCluster`, `InferenceClass`, and `InferenceGateway` resources, provisioning the fleet that ML teams deploy against. ## ML team The development team deploying models. They create `ModelDeployment`, `ModelService`, and `ModelCache` resources, declaring what a model needs without knowing which cluster it runs on. --- # AI tools Source: https://docs.modelplane.ai/overview/ai-tools/ The Modelplane docs are built to be read by AI assistants as well as people. You can connect a coding agent directly to this site, pull any page as Markdown, or point a model at a single index file that lists the whole documentation set. Every page also carries a **Copy page** menu next to its title with the same shortcuts. ## Connect to the MCP server The documentation MCP server lets an assistant search these docs and read any page in real time, so its answers track the current content instead of its training data. It exposes two tools: - `search_modelplane_docs`: search the docs and get back the most relevant sections with their titles, URLs, and snippets. - `get_modelplane_doc`: fetch the full Markdown of a single page. The server URL is: ```plaintext https://docs.modelplane.ai/mcp ``` {{< tabs >}} {{< tab "Claude Code" >}} ```bash claude mcp add --transport http modelplane-docs https://docs.modelplane.ai/mcp ``` {{< /tab >}} {{< tab "Claude Desktop" >}} Open Settings, go to Connectors, and choose **Add custom connector**. Name it `modelplane-docs`, enter the server URL above, and enable the connector when you start a conversation. {{< /tab >}} {{< tab "Cursor" >}} Open the command palette, run **Cursor Settings: MCP**, and add a server to `mcp.json`: ```json { "mcpServers": { "modelplane-docs": { "url": "https://docs.modelplane.ai/mcp" } } } ``` {{< /tab >}} {{< tab "VS Code" >}} Create `.vscode/mcp.json` in your workspace: ```json { "servers": { "modelplane-docs": { "type": "http", "url": "https://docs.modelplane.ai/mcp" } } } ``` {{< /tab >}} {{< tab "Other" >}} Any MCP client that speaks the streamable HTTP transport can connect to the server URL directly. No authentication is required. {{< /tab >}} {{< /tabs >}} The **Copy page** menu on every page also has **Connect to Cursor** and **Connect to VS Code** shortcuts that install the server in one click. ## Read pages as Markdown Every page is also published as raw Markdown. Add `index.md` to any page URL: ```plaintext https://docs.modelplane.ai/models/model-deployment/index.md ``` The **Copy page** control next to each title copies that Markdown to your clipboard, and **View as Markdown** opens it in the browser. Paste it into any assistant when you want to ground a question in a specific page. ## llms.txt For tools that index a whole site, the docs publish the [`llms.txt`](https://llmstxt.org) format: - [`llms.txt`](/llms.txt): a short index of every page with links and descriptions. - [`llms-full.txt`](/llms-full.txt): every page concatenated into one Markdown file. ## Page menu reference The **Copy page** menu next to each title has these actions: {{< table >}} | Action | What it does | |---|---| | Copy page | Copies the page as Markdown to your clipboard. | | View as Markdown | Opens the page as raw Markdown. | | Copy MCP Server | Copies the MCP server URL to your clipboard. | | Connect to Cursor | Installs the MCP server in Cursor. | | Connect to VS Code | Installs the MCP server in VS Code. | {{< /table >}} --- # Architecture Source: https://docs.modelplane.ai/architecture/ Modelplane's central design choice is to build the control plane on [Crossplane](https://crossplane.io) rather than as a bespoke set of Kubernetes controllers. Everything else here follows from that. This section assumes you're comfortable with Kubernetes; the rest of the Crossplane vocabulary you need is below. ## Crossplane in brief [Crossplane](https://crossplane.io) extends Kubernetes to manage things beyond the cluster, cloud infrastructure, SaaS, and in Modelplane's case inference fleets, through the same declarative, reconciled API model. Three of its concepts matter here: - **Composite Resources (XRs)** are custom resources whose controller, instead of talking to an external API directly, declares a set of other resources that should exist. Every Modelplane API, `InferenceCluster`, `ModelDeployment`, `ModelService`, is an XR. - **Composition functions** are that controller logic. A function is a small gRPC service handed the observed XR and the resources it depends on, which returns the desired child resources. An XR runs a pipeline of one or more functions every reconcile; in Modelplane each is typically a single function, so the rest of this section says "the function" for short. - **Providers** are controllers that manage external systems through their own managed resources: `provider-gcp` and `provider-aws` for cloud APIs, `provider-helm` for Helm releases, `provider-kubernetes` for arbitrary objects on any cluster. A composition function composes these like any other resource. Put together: a Modelplane API is an XR, its logic is a composition function, and the function composes a mix of plain Kubernetes objects, other Modelplane XRs, and provider resources. The resource model mirrors Kubernetes core, one scope up: `ModelDeployment` → `ModelReplica` → `ModelService` → `ModelEndpoint` parallels `Deployment` → `Pod` → `Service` → `Endpoint`, but across a fleet of clusters rather than within one. A `ModelDeployment` composes a `ModelReplica` per replica, a `ModelReplica` composes the serving workload on its target cluster, and a `ModelService` routes across the `ModelEndpoint`s. If you know how those core objects relate, you already know the shape of Modelplane's. ## Why Crossplane? Modelplane is, at its core, a system that turns declarative resources into composed infrastructure spanning cloud accounts, many Kubernetes clusters, and the workloads on them. That's the problem Crossplane solves, and it helps in two ways: providers and functions. **Providers** give us reach. Modelplane has to provision Kubernetes clusters and all the infrastructure they need across different clouds, then install software onto them. That's an enormous surface, and providers cover it without us rolling our own controllers for each cloud API and Helm release. **Functions** are where Modelplane's own logic lives, and writing it as composition functions buys several things: - **Business logic, not controller plumbing.** A function computes desired state from observed state. Crossplane handles the fiddly Kubernetes controller details, the watches, requeues, finalizers, and drift correction, that a hand-written controller gets wrong in a dozen subtle ways. Less plumbing to write and maintain means we move faster. - **Testability.** A function is a pure function of its inputs, so you can test it as a black box: feed it an XR and its dependencies, assert on the resources it returns. The whole test runs in process, with no API server to stand up. - **The right language for each job.** Functions can be written in any language. Modelplane's are Python, for fast iteration on the serving and scheduling logic and because Python is the common language of the ML world, which lowers the bar for contributors. The performance-sensitive distributed-systems core stays in Go, where Crossplane and its providers already are. The bet underneath both is that inference infrastructure is the same shape of problem as cloud infrastructure, which Crossplane manages well. Building on it lets Modelplane spend its effort on the part that's actually inference-specific. ## The control cluster and the fleet Modelplane runs on a **control cluster** and manages a fleet of **workload clusters**, the `InferenceCluster`s. The split is deliberate: the control plane holds no GPUs and serves no tokens. It schedules, composes, and routes; the workload clusters do the serving. The control cluster runs Crossplane, the Modelplane composition functions (one per resource, each a pod Crossplane calls per reconcile), the providers, and the control-plane gateway. It also holds every Modelplane resource and the `ProviderConfig`s that let the providers reach each workload cluster, built from that cluster's kubeconfig. Crossplane core drives everything. Each reconcile it asks a function what a resource should compose and gets back the desired resources. Core then reconciles them, applying the provider resources that the providers act on. A function only computes desired state. It never reaches a provider or a cluster itself. ```mermaid flowchart TB subgraph control["Control cluster"] cp["Crossplane core"] fns["Modelplane functions\n(one pod per resource)"] prov["Providers\ngcp · aws · helm · kubernetes"] gw["Control-plane gateway"] end subgraph fleet["Fleet"] wc1["Workload cluster A"] wc2["Workload cluster B"] end cp <-->|"desired state (gRPC)"| fns cp -->|composes| prov cp -->|composes| gw prov -->|provision + install via kubeconfig| wc1 prov -->|provision + install via kubeconfig| wc2 ``` Modelplane installs a serving stack on each workload cluster: the components a cluster needs to serve models, providing inference-aware routing through Gateway API, multi-node serving, GPU binding through DRA, and observability, among others. The exact components evolve, but Modelplane composes and owns all of them. For provisioned clusters the providers also create the cluster and its node pools first. ## How a deployment is composed A resource composes others, which compose others, until the tree bottoms out in provider resources and plain Kubernetes objects. A `ModelDeployment` is the clearest example. Its function schedules the replicas, then composes a `ModelReplica` for each, and a `ModelEndpoint` for each replica that's ready to serve. Each `ModelReplica` function composes the serving workload, a Deployment or a LeaderWorkerSet, onto its target workload cluster through provider-kubernetes. ```mermaid flowchart TD md["ModelDeployment"] mr1["ModelReplica\n(cluster A)"] mr2["ModelReplica\n(cluster B)"] me1["ModelEndpoint\n(cluster A)"] me2["ModelEndpoint\n(cluster B)"] wl1["Deployment / LeaderWorkerSet\non workload cluster A"] wl2["Deployment / LeaderWorkerSet\non workload cluster B"] md --> mr1 md --> mr2 md --> me1 md --> me2 mr1 --> wl1 mr2 --> wl2 ``` The platform resources compose the same way. An `InferenceCluster` composes a `GKECluster` or `EKSCluster` (the cloud infrastructure, via the cloud providers) and a `ServingStack` (the per-cluster software install, via provider-helm and provider-kubernetes). Engines bind GPUs through DRA: each `claim: DRA` device in a member's `nodeSelector` becomes a request in the `ResourceClaim` the serving pods claim through. ## The request path A served request crosses two gateways, both built on Gateway API. The **control-plane gateway** is the front door: a `ModelService` composes an `HTTPRoute` on it that matches the service's path prefix and forwards to the matched `ModelEndpoint`s, each of which is a `Service` pointing at a workload cluster's gateway address. The **workload-cluster gateway** then routes from the cluster edge to the engine pods. ```mermaid flowchart LR client["Client"] cpgw["Control-plane gateway"] wcgw["Workload-cluster gateway"] engine["Engine pods\n(vLLM, SGLang, ...)"] client -->|service path| cpgw cpgw -->|per-replica path| wcgw wcgw -->|engine path| engine ``` Each hop rewrites the path: the control plane rewrites the public prefix to the replica's path, and the workload gateway strips that down to what the engine serves. This per-backend path rewriting is the main thing the control-plane gateway has to support, and it narrows which Gateway API implementations can fill the role. Which gateway sits at each layer is internal, not part of the API. The [`InferenceGateway`]({{< ref "/platform/inference-gateway.md" >}}) `backend` field is an enum precisely so the control-plane gateway can grow other options over time. Target the `ModelService` URL rather than either gateway directly. --- # Llama-3.1-8B Source: https://docs.modelplane.ai/examples/llama-3.1-8b/ An 8B dense chat model on a single NVIDIA L4. The entry recipe: one `Standalone` engine, no cache, public weights from a Hugging Face mirror. It carries no `clusterSelector`, so device capacity alone matches it to any compatible L4 in the fleet. This recipe was run end to end on GKE; the `InferenceClass`, `InferenceCluster`, and `ModelDeployment` are the exact manifests from that run. The EKS platform shape is the standard single-L4 recipe. It passes server validation but was not served in this run. Apply the platform side first, then the ML side. The GKE `InferenceCluster` carries a GCP project placeholder to edit before applying. ## Platform {{< tabs >}} {{< tab "EKS" >}} {{< manifests "examples/llama-3.1-8b/inference-class-eks.yaml" >}} {{< manifests "examples/llama-3.1-8b/inference-cluster-eks.yaml" >}} {{< /tab >}} {{< tab "GKE" >}} {{< manifests "examples/llama-3.1-8b/inference-class-gke.yaml" >}} {{< manifests path="examples/llama-3.1-8b/inference-cluster-gke.yaml" apply="false" >}} {{< editCode >}} ```bash curl -fsSL {{< manifest-url "examples/llama-3.1-8b/inference-cluster-gke.yaml" >}} \ | sed 's/my-gcp-project/$@$@/' \ | kubectl apply -f - ``` {{< /editCode >}} {{< /tab >}} {{< /tabs >}} ## Deployment {{< manifests "examples/llama-3.1-8b/model-deployment.yaml" >}} {{< manifests "examples/llama-3.1-8b/model-service.yaml" >}} --- # Route to External Providers Source: https://docs.modelplane.ai/models/model-endpoint/ **API:** [`modelplane.ai/v1alpha1` · ModelEndpoint]({{< ref "/reference/modelendpoints" >}}) A `ModelEndpoint` is a single reachable inference endpoint that a [`ModelService`]({{< ref "model-service.md" >}}) can route to. Modelplane creates one for each of your replicas automatically, but you can also create one by hand to point at an inference endpoint Modelplane doesn't run, most often a SaaS provider like Together or Baseten. A service treats both the same, so you can front your own replicas and an external provider behind one URL: send overflow to the provider when your fleet is busy, or fail over to it as a break-glass option. ## Routing to an external provider Create a `ModelEndpoint` with three things: ```yaml {nocopy=true} apiVersion: modelplane.ai/v1alpha1 kind: ModelEndpoint metadata: name: kimi-k2-together namespace: ml-team labels: # 1. A label of your own for a ModelService to select on. Any label # works; modelplane.ai/external-provider is a readable convention. modelplane.ai/external-provider: together spec: # 2. The provider's base URL. url: https://api.together.xyz/ # 3. The path to rewrite requests to. A ModelService receives requests at # ///v1/... and rewrites them to this prefix, so an # OpenAI-compatible provider that serves /v1/... takes /v1/. rewritePath: /v1/ ``` Then point a [`ModelService`]({{< ref "model-service.md" >}}) at it. Selecting `modelplane.ai/external-provider: together` routes to the provider; adding a second entry for a deployment fronts both behind one URL, so traffic can spill over to the provider alongside your own replicas: ```yaml {nocopy=true} apiVersion: modelplane.ai/v1alpha1 kind: ModelService metadata: name: kimi-k2 namespace: ml-team spec: endpoints: - selector: matchLabels: modelplane.ai/deployment: kimi-k2 # your own replicas - selector: matchLabels: modelplane.ai/external-provider: together # the endpoint above ``` The provider must speak the OpenAI API, since that's the contract a `ModelService` exposes. Anything OpenAI-compatible works; `url` and `rewritePath` are all that change between providers. ## Example {{< manifests "concepts/model-endpoint.yaml" >}} --- # Scale the platform Source: https://docs.modelplane.ai/getting-started/scale-the-platform/ You have one L4 cluster with a running model. In this guide, you'll add two larger-GPU clusters in different regions to grow the fleet available to the ML team. Provisioning two more clusters takes about 10 to 15 minutes. ## Register more clusters {{< tabs >}} {{< tab "EKS" >}} Register two more clusters with a bigger hardware class: `L40S` (`48 Gi`) in `us-west` and `eu-central`: {{< manifests "getting-started/eks/platform-scale.yaml" >}} {{< hint "note" >}} `g6e.xlarge` runs ~$2/hr on demand. Two of them plus the `L4` from earlier is a few dollars for this tour. Clean up when you're done (see [Clean up]({{< ref "getting-started/clean-up.md" >}})). {{< /hint >}} {{< /tab >}} {{< tab "GKE" >}} Register two more clusters with a bigger hardware class: `A100` (`40 Gi`) in `us-west` and `us-east`. Apply the manifest, setting each cluster's `project` to your GCP project: {{< manifests path="getting-started/gke/platform-scale.yaml" apply="false" >}} {{< editCode >}} ```bash curl -fsSL {{< manifest-url "getting-started/gke/platform-scale.yaml" >}} \ | sed 's/my-gcp-project/$@$@/g' \ | kubectl apply -f - ``` {{< /editCode >}} {{< hint "note" >}} `a2-highgpu-1g` runs ~$3.50/hr on demand. Two of them plus the `L4` from earlier is a few dollars for this tour. Clean up when you're done (see [Clean up]({{< ref "getting-started/clean-up.md" >}})). {{< /hint >}} {{< /tab >}} {{< /tabs >}} Modelplane provisions both clusters in parallel: ```bash kubectl wait --for=condition=Ready ic --all --timeout=20m ``` ## Your model keeps running Growing the fleet doesn't disturb anything already deployed. `qwen-demo` stays on its original cluster and the two new clusters add capacity the moment they're `Ready` with no interruption for the ML team. A replica only moves if its deployment changes in a way that no longer fits where it runs. ## Next step The fleet now spans three clusters across three regions. The ML team is next. [Scale the model]({{< ref "getting-started/scale-the-model.md" >}}) to serve it from two regions behind a single endpoint. --- # Supported Providers Source: https://docs.modelplane.ai/platform/providers/ Modelplane is built on [Crossplane](https://crossplane.io) and shares its infrastructure providers, so the set of clouds and neoclouds it reaches grows alongside Crossplane itself. This page shows where Modelplane runs today and where it's headed. A provider can show up here in three ways: {{< hint "note" >}} - **Provisioning supported.** Modelplane creates and manages the whole cluster from an `InferenceCluster`, selected through `provisioning.provider`. GKE and EKS work this way today. - **Bring your own supported.** Register a cluster you already run with `source: Existing`. This works on any provider whose Kubernetes meets Modelplane's requirements (Dynamic Resource Allocation and a recent Kubernetes version), so you can run on the providers below now, ahead of native provisioning. - **Crossplane provider exists.** A Crossplane provider is published for the cloud. That provider is the path by which native provisioning lands, so it marks where Modelplane can grow next. {{< /hint >}} ## Clouds and neoclouds Listed alphabetically, spanning hyperscalers and GPU-specialist neoclouds. Each runs a managed Kubernetes service with GPU node pools, so the bring-your-own path covers them all today. Where a Crossplane provider exists, it's the path to native provisioning. {{< table >}} | Provider / service | Accelerators | Provisioning | BYO | Crossplane | |---|---|---|---|---| | Alibaba Cloud (ACK) | {{< accel nvidia >}} | Planned | ✓ | {{< repolink "https://github.com/crossplane-contrib/provider-upjet-alibabacloud" "provider-upjet-alibabacloud" "community" >}} | | AWS (EKS) | {{< accel nvidia >}} {{< accel trainium >}} | ✓ | ✓ | {{< repolink "https://github.com/crossplane-contrib/provider-upjet-aws" "provider-upjet-aws" "community" >}} | | Civo (K3s) | {{< accel nvidia >}} | Planned | ✓ | {{< repolink "https://github.com/crossplane-contrib/provider-civo" "provider-civo" "community" >}} | | CoreWeave (CKS) | {{< accel nvidia >}} | Planned | ✓ | none yet | | Crusoe (CMK) | {{< accel nvidia >}} {{< accel amd >}} | Planned | ✓ | none yet | | DigitalOcean (DOKS) | {{< accel nvidia >}} {{< accel amd >}} | Planned | ✓ | {{< repolink "https://github.com/crossplane-contrib/provider-upjet-digitalocean" "provider-upjet-digitalocean" "community" >}} | | Fluidstack | {{< accel nvidia >}} | Planned | ✓ | none yet | | Google Cloud (GKE) | {{< accel nvidia >}} {{< accel tpu >}} | ✓ | ✓ | {{< repolink "https://github.com/crossplane-contrib/provider-upjet-gcp" "provider-upjet-gcp" "community" >}} | | Huawei Cloud (CCE) | {{< accel nvidia >}} {{< accel ascend >}} | Planned | ✓ | {{< repolink "https://github.com/huaweicloud/provider-huaweicloud" "provider-huaweicloud" "alpha" >}} | | IBM Cloud (IKS) | {{< accel nvidia >}} | Planned | ✓ | none active | | Lambda | {{< accel nvidia >}} | Planned | ✓ | none yet | | Linode / Akamai (LKE) | {{< accel nvidia >}} | Planned | ✓ | {{< repolink "https://github.com/linode/provider-linode" "provider-linode" "official" >}} | | Microsoft Azure (AKS) | {{< accel nvidia >}} | Planned | ✓ | {{< repolink "https://github.com/crossplane-contrib/provider-upjet-azure" "provider-upjet-azure" "community" >}} | | Nebius | {{< accel nvidia >}} | Planned | ✓ | none yet | | Oracle Cloud (OKE) | {{< accel nvidia >}} {{< accel amd >}} | Planned | ✓ | {{< repolink "https://github.com/oracle/crossplane-provider-oci" "crossplane-provider-oci" "official" >}} | | OVHcloud | {{< accel nvidia >}} | Planned | ✓ | {{< repolink "https://github.com/edixos/provider-ovh" "edixos/provider-ovh" "community" >}} | | Scaleway (Kapsule) | {{< accel nvidia >}} | Planned | ✓ | {{< repolink "https://github.com/scaleway/crossplane-provider-scaleway" "crossplane-provider-scaleway" "official" >}} | | Tencent Cloud (TKE) | {{< accel nvidia >}} | Planned | ✓ | {{< repolink "https://github.com/crossplane-contrib/provider-tencentcloud" "provider-tencentcloud" "community" >}} | | Voltage Park | {{< accel nvidia >}} | Planned | ✓ | none yet | | Vultr (VKE) | {{< accel nvidia >}} {{< accel amd >}} | Planned | ✓ | {{< repolink "https://github.com/vultr/crossplane-provider-vultr" "crossplane-provider-vultr" "official" >}} | {{< /table >}} {{< hint "note" >}} **On-premises and bare metal.** Bring an on-prem cluster the same way as any other: stand up Kubernetes on your own hardware (like NVIDIA DGX BasePOD or SuperPOD) with NVIDIA Base Command Manager, Run:ai, or your own tooling, then register it with `source: Existing`. Provisioning it for you is on the roadmap too. Modelplane can drive NVIDIA Base Command Manager or other bare-metal Kubernetes provisioners through Crossplane, the same pattern it uses in the cloud. {{< /hint >}} Native provisioning expands as more Crossplane providers ship; until then, the bring-your-own path runs Modelplane on any conformant Kubernetes cluster today. {{< hint "tip" >}} Don't see your cloud or neocloud, or want to be added? [Open an issue](https://github.com/modelplaneai/modelplane/issues/new) and we'll track it. {{< /hint >}} {{< cardgroup cols="2" >}} {{< card title="Register a Cluster" href="/platform/inference-cluster/" >}} Add a cluster to Modelplane, provisioned or bring-your-own. {{< /card >}} {{< card title="Define Hardware Classes" href="/platform/inference-class/" >}} Describe the GPUs and provisioning recipe each node pool uses. {{< /card >}} {{< /cardgroup >}} --- # API Reference Source: https://docs.modelplane.ai/reference/ Modelplane's API is a set of Kubernetes custom resources. Each type below has its own page with the full spec and status schema, a runnable example, and fields you can link to directly. For release history, see the [GitHub releases page](https://github.com/modelplaneai/modelplane/releases). --- # Scale the model Source: https://docs.modelplane.ai/getting-started/scale-the-model/ A `ModelService` can front more than one `ModelDeployment`. Here you add a second deployment, pinned to a different region, and point the same service at both. The endpoint you already curled stays the same. Behind it, traffic now load-balances across two regions. ```mermaid graph LR subgraph fleet ["Fleet"] IC1["us-east\nL4"] IC2["us-west\nlarger GPU"] end subgraph ml ["ML team"] MD1["ModelDeployment\nqwen-demo"] MD2["ModelDeployment\nqwen-west\nclusterSelector: us-west"] MS["ModelService qwen\n/ml-team/qwen/v1/..."] end IC1 --> MD1 IC2 --> MD2 MD1 --> MS MD2 --> MS ``` ## Deploy to a second region The new deployment uses a `clusterSelector` to pin its replica to the `us-west` cluster you added in the last step, and selects the larger GPU there: {{< tabs >}} {{< tab "EKS" >}} {{< manifests "getting-started/eks/model-deployment-west.yaml" >}} {{< /tab >}} {{< tab "GKE" >}} {{< manifests "getting-started/gke/model-deployment-west.yaml" >}} {{< /tab >}} {{< /tabs >}} Wait until its replica is `Ready`, then check placement. You now have one replica per region: ```bash kubectl get modelreplica -n ml-team ``` ```shell {nocopy=true} NAME CLUSTER SYNCED READY COMPOSITION AGE qwen-demo-7323a eks-us-east True True modelreplicas.modelplane.ai 42m qwen-west-92535 eks-us-west True True modelreplicas.modelplane.ai 8m ``` ## Front both with one service Update the `ModelService` to select both deployments. Each entry in `spec.endpoints` adds its matching replicas to the same endpoint: {{< manifests "getting-started/model-service-multi.yaml" >}} The endpoint URL doesn't change. Clients that had this URL before still have it; they don't know the fleet changed. The gateway load-balances across both regions, and losing one region keeps the other serving. Send the same request as before: ```bash ADDRESS=$(kubectl get ms qwen -n ml-team -o jsonpath='{.status.address}') ``` ```bash kubectl run -i --rm curl-test \ --image=curlimages/curl \ --restart=Never \ --env="ADDRESS=$ADDRESS" \ -- sh -c 'curl -v "$ADDRESS/v1/chat/completions" \ -H "Content-Type: application/json" \ -d "{\"model\":\"Qwen/Qwen2.5-0.5B-Instruct\",\"messages\":[{\"role\":\"user\",\"content\":\"What is Kubernetes in one sentence?\"}],\"max_tokens\":100}"' ``` ## That's the tour You stood up a control plane, built a multi-region GPU fleet, deployed a model across it, and ended with one stable endpoint serving requests. The platform team published hardware. The ML team described what the model needs. Modelplane placed them and served behind a single endpoint. [Clean up]({{< ref "getting-started/clean-up.md" >}}) tears everything down when you're done. For more on the resources you used: * [InferenceClass]({{< ref "platform/inference-class.md" >}}) * [InferenceCluster]({{< ref "platform/inference-cluster.md" >}}) * [ModelDeployment]({{< ref "models/model-deployment.md" >}}) * [ModelService]({{< ref "models/model-service.md" >}}) Modelplane is in active development and we're building in the open. If you're running your own inference fleet and want to shape where this goes, we'd love to hear from you. Star the [repository](https://github.com/modelplaneai/modelplane), join us in [Slack](https://slack.crossplane.io), or read the [manifesto](https://modelplane.ai). --- # Clean up Source: https://docs.modelplane.ai/getting-started/clean-up/ Delete the model resources, clusters, and finally the control plane. ## Delete model resources Delete model resources before clusters. Deleting a cluster first leaves the deployments reconciling against infrastructure that no longer exists. ```bash kubectl delete md --all -n ml-team kubectl delete ms --all -n ml-team ``` Wait for all model replicas to finish: ```bash kubectl get modelreplica -n ml-team --watch ``` ## Delete the clusters Delete all clusters with foreground cascading deletion. The serving stack on each workload cluster must uninstall while that cluster's API server is still reachable. Foreground deletion holds each cluster object until its stack finishes. Background deletion can orphan cloud resources. ```bash kubectl delete ic --all --cascade=foreground ``` Wait until all clusters are deleted: ```bash kubectl get ic --watch ``` ## Delete the control plane Delete the kind cluster: ```bash kind delete cluster --name modelplane ``` --- # EKSCluster Source: https://docs.modelplane.ai/reference/eksclusters/ An EKSCluster provisions an EKS cluster with dedicated node groups for GPU inference and system workloads. It outputs a Secret containing the cluster kubeconfig that consumers use to target the cluster. The kubeconfig embeds a static bearer token that the AWS provider refreshes. --- # GKECluster Source: https://docs.modelplane.ai/reference/gkeclusters/ A GKECluster provisions a GKE cluster with dedicated node pools for GPU inference and system workloads. It outputs secrets containing the cluster kubeconfig and a GCP service account key that consumers can use to target the cluster. --- # ServingStack Source: https://docs.modelplane.ai/reference/servingstacks/ A ServingStack installs the serving substrate (LeaderWorkerSet, Gateway API, cert-manager, Prometheus) on a Kubernetes cluster.