# Modelplane Documentation

> Modelplane is the open source control plane for AI model serving. It extends Crossplane to manage AI inference across a fleet of GPU clusters.

---

# Overview

Source: https://docs.modelplane.ai/overview/

<!-- vale write-good.TooWordy = NO -->
<!-- vale write-good.Passive = NO -->
Modelplane is the open source control plane for AI inference. It's software you
install and run in your own environment, and it orchestrates the models, serving
stack, and infrastructure across cloud, neocloud, and on-premise. Modelplane
supports running any model and any engine on any infrastructure, with the
frontier-level serving topologies and performance the largest models demand,
from a single GPU to disaggregated, multi-node deployments.

Modelplane operates across the whole fleet: provisioning inference clusters,
scheduling model deployments on compatible clusters, autoscaling model replicas
across clusters, caching model weights across clusters, and routing across
clusters.

It's an active system that is always reconciling the fleet toward the state you
declare. You install Modelplane on a Kubernetes cluster, which becomes the
control cluster for your inference fleet. It's built on
[Crossplane](https://crossplane.io) and fully integrates with your existing
platform systems.

{{< hint warning >}}
Modelplane is under active development. We have opted to build the project in the
open, collaborating with the broad AI inference community on integrations and
capabilities.
{{< /hint >}}

## Deploy a model

Modelplane's API is declarative, designed for platform teams responsible for the
inference infrastructure and developers deploying models on that infrastructure.

Once a platform team has provisioned inference clusters and declared the available
GPUs and networking fabric, an ML development team deploys a model with a
declarative manifest:

```yaml {nocopy=true}
apiVersion: modelplane.ai/v1alpha1
kind: ModelDeployment
metadata:
  name: qwen-demo
  namespace: ml-team
spec:
  replicas: 1
  engines:
  - name: qwen
    members:
    - role: Standalone
      nodeSelector:
        devices:
        - name: gpu
          count: 1
          selectors:
          - cel: device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("20Gi")) >= 0
      template:
        spec:
          containers:
          - name: engine
            image: vllm/vllm-openai:v0.23.0
            args: ["--model=Qwen/Qwen2.5-0.5B-Instruct"]
```

Modelplane schedules a model replica onto an inference cluster with free,
compatible GPUs and memory, and deploys the serving engine. Exposing an
OpenAI-compatible endpoint can be done by declaring a model service:

```yaml {nocopy=true}
apiVersion: modelplane.ai/v1alpha1
kind: ModelService
metadata:
  name: qwen
  namespace: ml-team
spec:
  endpoints:
  - selector:
      matchLabels:
        modelplane.ai/deployment: qwen-demo
```
<!-- vale Microsoft.HeadingAcronyms = NO -->
## A universal control plane for AI inference
<!-- vale Microsoft.HeadingAcronyms = YES -->

Modelplane is designed to be a universal control plane for inference. It runs
inference clusters on any cloud, neocloud, or on-premise environment, or any
combination of them. Modelplane can provision the clusters for you, or you can
bring your own.

It supports any serving engine that runs as a container, and can serve
frontier-quality models using advanced topologies including tensor parallel,
pipeline parallel, data and expert parallel, and prefill/decode disaggregation.
Modelplane works across different accelerators and networking fabrics, and
schedules each model's replicas by matching the model's hardware requirements to
the hardware available across your clusters.

## What Modelplane is not

Modelplane is not a serving engine like vLLM, SGLang, or TensorRT-LLM. Modelplane
composes serving engines and orchestrates them fleet-wide across cloud, neocloud,
and on-premise. Modelplane is not a managed inference service like Baseten,
Together, or Fireworks. These offer cloud services, while Modelplane is
self-hosted software.

## Next steps

{{< cardgroup cols="2" >}}
{{< card title="Get started" href="/getting-started/" cta="Deploy on a real fleet" >}}
Go from nothing to a live OpenAI-compatible endpoint in about 45 minutes.
{{< /card >}}
{{< card title="Why Modelplane" href="/overview/why/" cta="Learn more" >}}
Learn more about Modelplane's capabilities and how it works.
{{< /card >}}
{{< /cardgroup >}}

<!-- vale write-good.Passive = YES -->
<!-- vale write-good.TooWordy = YES -->


---

# Deploy a Model

Source: https://docs.modelplane.ai/models/model-deployment/

**API:** [`modelplane.ai/v1alpha1` · ModelDeployment]({{< ref "/reference/modeldeployments" >}})
<!-- vale write-good.Passive = NO -->
A `ModelDeployment` is the ML team's primary interface. You describe the model
you want served, the hardware it needs, and how many copies to run; Modelplane
schedules it onto matching clusters and keeps it running. You never name a
cluster.

Modelplane is unopinionated about the engine itself. You bring the container and
its flags, and Modelplane shapes a serving topology around it. The engine flags
you write carry parallelism, quantization, and KV transfer, never injected by
Modelplane.

A deployment's `spec.engines` describes its topology through two choices:

- **One pod or a gang**: whether an engine is a single `Standalone` pod or a
  `Leader` with one or more `Worker` pods coordinating across nodes.
- **Unified or disaggregated**: whether `spec.serving.mode` keeps prefill and
  decode together (`Unified`, the default) or splits them across two engines
  (`PrefillDecode`).

How many of each to run is a separate question, covered in
[Sizing a deployment](#sizing-a-deployment).

## Single-node

The default, and what the [getting started tour]({{< ref "/getting-started" >}})
deploys. One `Standalone` member is one pod on one node, claiming that node's
GPUs through its `nodeSelector`. It's usually the right choice when a model fits
on a single node. Within a node, tensor parallelism is an engine flag
(`--tensor-parallel-size`), not a Modelplane concept.

```yaml {nocopy=true}
engines:
- name: qwen
  members:
  - role: Standalone        # one pod, one node
```

## Multi-node

When a model is too large for one node's GPUs, make the engine a gang: a `Leader`
and a `Worker` whose `worker.nodes` expands to that many worker pods, one per
node. The pods serve the model together; how the model splits across them
(tensor, pipeline, data, or expert parallelism) is up to your engine flags.

A gang should use a [`ModelCache`]({{< ref "model-cache.md" >}}) via
`spec.modelCacheRef`, so every pod mounts the same weights instead of each
pulling its own.

```yaml {nocopy=true}
modelCacheRef:
  name: qwen3-coder         # recommended for gangs
engines:
- name: qwen3-coder
  members:
  - role: Leader
  - role: Worker
    worker:
      nodes: 1              # one worker pod per node
```

A member's `env` can read pod fields through `valueFrom.fieldRef`, like setting
vLLM's `VLLM_HOST_IP` from `status.podIP`, which multi-NIC RDMA nodes need so the
engine binds the right interface instead of guessing it.

## Disaggregated serving

The prefill and decode phases have opposite hardware profiles, and on one engine
a prefill burst stalls the decodes already running. Set
`spec.serving.mode: PrefillDecode` to run them as two engines, one marking
`phase: Prefill` and the other `phase: Decode`. Modelplane fronts the pair with
inference-aware routing that sequences prefill then decode, moving the KV cache
between them. Each phase can sit on the GPU class that suits it.

```yaml {nocopy=true}
serving:
  mode: PrefillDecode       # the two engines below are one P/D pair
engines:
- name: prefill
  phase: Prefill
- name: decode
  phase: Decode
```

Disaggregation pays off for large models under load with strict latency targets
and long context. For small models or low traffic, the KV-transfer overhead
outweighs the benefit, so unified serving is the default.

It requires an engine image that includes the **NIXL** KV-transfer runtime.
vLLM's `NixlConnector` (and SGLang's prefill/decode transfer) import the `nixl`
package, so disaggregated engines crash at startup with `NIXL is not available`
on an image that lacks it. Recent vanilla `vllm/vllm-openai` images include NIXL,
so pin a current tag rather than an old one. The engine image is yours to choose,
so this is a prerequisite Modelplane does not bundle for you.

## Requesting GPUs

You don't name a cluster or a GPU model. Instead each member's `nodeSelector`
lists the hardware its pods need, and Modelplane finds a node pool that has it.
The platform team publishes node pools as `InferenceClass` resources, each
describing the devices its nodes carry. Your request is matched against them.

A request names a device (`gpu`), how many of it each pod needs (`count`), and
one or more `selectors` the device must match:

```yaml {nocopy=true}
nodeSelector:
  devices:
  - name: gpu
    count: 1                # one GPU per pod
    selectors:
    - cel: |
        device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("40Gi")) >= 0
```

Each selector is a single line of [CEL](https://cel.dev/), a small expression
language, that returns true or false for one device. The part in brackets, `"gpu.nvidia.com"`, is the
GPU vendor's driver. The fields after it, like `memory` or `architecture`, are
what the platform team published for that device. This one says "match a GPU
whose memory is at least 40Gi." A device has to match every selector in the
request. Give two selectors to mean "Hopper, with at least 80Gi."

### Requesting more than one device

`devices` is a list, so a member can ask for distinct kinds of hardware at once,
each its own entry with its own `count` and `selectors`. A node pool matches the
member only when it satisfies every entry. This is how you ask for both a GPU and
a fast NIC on the same node:

```yaml {nocopy=true}
nodeSelector:
  devices:
  - name: gpu
    count: 8
    selectors:
    - cel: device.attributes["gpu.nvidia.com"].architecture == "Hopper"
  - name: nic
    count: 1
    selectors:
    - cel: device.attributes["nic.nvidia.com"].linkType == "infiniband"
```

### What you can match on

Each selector is evaluated against one device and must return a boolean. The
device exposes three things:

- `device.driver`: the device's driver, a string.
- `device.attributes["<driver>"].<name>`: a typed attribute (string, bool, int,
  or version), such as `architecture` or `cudaComputeCapability`.
- `device.capacity["<driver>"].<name>`: a capacity quantity, such as `memory`.

Two helpers build comparable values: `quantity()` parses Kubernetes quantities
like `"40Gi"`, and `semver()` parses versions like `"9.0.0"`. Both support
`compareTo` (which orders two values), `isGreaterThan`, and `isLessThan`. Combine
selectors with the usual CEL operators (`==`, `!=`, `>=`, `&&`, `||`).

```yaml {nocopy=true}
selectors:
# Capacity: at least 40Gi of GPU memory. >= 0 reads as "left is at least right".
- cel: device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("40Gi")) >= 0
# Attribute equality: a specific architecture.
- cel: device.attributes["gpu.nvidia.com"].architecture == "Hopper"
# Version attribute: a minimum CUDA compute capability.
- cel: device.attributes["gpu.nvidia.com"].cudaComputeCapability.isGreaterThan(semver("8.9.0"))
# Driver: match any device from a given driver.
- cel: device.driver == "gpu.nvidia.com"
# Presence: only match a device that publishes a given domain.
- cel: '"gpu.nvidia.com" in device.attributes'
# Two conditions in one selector.
- cel: |
    device.attributes["gpu.nvidia.com"].architecture == "Hopper" &&
    device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("80Gi")) >= 0
```

This is the Kubernetes DRA device selector expression surface. The
Kubernetes-specific CEL extension libraries (such as regular expressions and IP
address helpers) aren't available. Selectors in practice are attribute and
capacity comparisons like those above.

### Seeing what's available

To see what you can match against, list the classes the platform team has
published and look at the devices each one declares:

```bash
kubectl get inferenceclass
kubectl describe inferenceclass gke-l4-1x-g2
```

The `describe` output shows each device's driver, attributes (like
`architecture`), and capacity (like `memory`), which are exactly the keys your
selectors read. If a selector asks for something no published class offers, the
deployment won't schedule.

## Sizing a deployment

Three independent numbers control how many pods a deployment runs:

- **`spec.replicas`** stamps out whole copies of the entire topology. Each
  replica is a complete serving instance, and replicas usually land on different
  clusters. This is the scaling axis (see [Scaling](#scaling)).
- **`engines[].copies`** runs several identical copies of one engine within a
  replica, on the same cluster. It's a fixed number, sized once, never
  autoscaled. Copies make a replica more resilient within its cluster: a node
  failure drops one copy instead of taking the whole replica out of service. In
  disaggregated serving they also set the prefill-to-decode ratio.
- **`worker.nodes`** sets how many nodes one gang spans: a `Leader` plus that
  many `Worker` pods. It's how big a single multi-node engine is.

## Scaling

`spec.replicas` is the only scaling axis. Each replica is a complete,
fixed-shape serving instance, so scaling adds or removes whole instances across
the fleet. Because the deployment exposes the Kubernetes scale subresource,
`kubectl scale` and KEDA work without anything extra. There's no in-cluster pod
autoscaling.

## Choosing a topology

| Topology | Use when | How you set it |
|----------|----------|----------------|
| Single-node | The model fits on one node's GPUs | One `Standalone` member (the default) |
| Multi-node | The model is too large for one node | A `Leader` and one or more `Worker` members, ideally with a `modelCacheRef` |
| Disaggregated serving | Large model, heavy load, strict latency, long context | `serving.mode: PrefillDecode` with two phase engines |

## Examples

{{< tabs >}}
{{< tab "Single-node" >}}
{{< manifests "concepts/model-deployment.yaml" >}}
{{< /tab >}}
{{< tab "Multi-node" >}}
{{< manifests "concepts/model-deployment-multinode.yaml" >}}
{{< /tab >}}
{{< /tabs >}}
<!-- vale write-good.Passive = YES -->


---

# Get started

Source: https://docs.modelplane.ai/getting-started/


Modelplane is an open source control plane for AI inference. It separates two
concerns: a platform team managing GPU capacity, and ML teams deploying models
against it. Without it, every change on one side creates work for the other.
When the platform team updates infrastructure, ML teams have to react. When
model requirements change, the platform team gets a request.

With Modelplane, the platform team publishes hardware without knowing what
models will run on it. The ML team declares what a model needs without knowing
what clusters exist. The control plane resolves it and keeps it current as
both sides change.

In this tour, you'll switch between provisioning infrastructure and declaring a
model to see how they interact. By the end you'll have a GPU fleet across three regions and one OpenAI-compatible endpoint routing to a model served across two of them.

This is not a production setup and takes around 45 minutes to run.

## What you'll build

The platform team provisions a starter cluster and grows it to two A100 regions;
the ML team serves a model on the L4, then scales it onto an A100, all behind one
endpoint.

{{< asciinema src="what-youll-build.cast" poster="npt:2:13" >}}

## Before you begin

You'll need [kind](https://kind.sigs.k8s.io/),
[kubectl](https://kubernetes.io/docs/tasks/tools/), and
[Helm](https://helm.sh/docs/intro/install/) installed, plus an AWS or GCP account
with permission to create clusters. Each step covers what it needs as you reach
it.

## The tour

1. [Installation]({{< ref "getting-started/installation.md" >}}): stand up the Modelplane control plane.
2. [Build the platform]({{< ref "getting-started/build-the-platform.md" >}}): provision your first GPU cluster.
3. [Deploying a model]({{< ref "getting-started/deploying-a-model.md" >}}): serve a model and send it a request.
4. [Scale the platform]({{< ref "getting-started/scale-the-platform.md" >}}): grow to a multi-region fleet.
5. [Scale the model]({{< ref "getting-started/scale-the-model.md" >}}): serve the model from two regions behind one endpoint.

First, follow the [Installation]({{< ref "getting-started/installation.md"
>}}) guide.


---

# Installation

Source: https://docs.modelplane.ai/getting-started/installation/

The control plane is where everything in Modelplane runs. In this step you'll install it on a local kind cluster, using Crossplane for reconciliation and the Modelplane APIs. No cloud yet, that comes next.

This step takes about five minutes.

## Prerequisites

Install [kind](https://kind.sigs.k8s.io/),
[kubectl](https://kubernetes.io/docs/tasks/tools/), and
[Helm](https://helm.sh/docs/intro/install/) on your machine.

{{< hint "note" >}}
You can run your Modelplane control plane anywhere. This tour uses kind for
illustration.
{{< /hint >}}

## Install the control plane

Crossplane provides the reconciliation engine and package management. Create the
kind cluster and install it with Helm:

```bash
kind create cluster --name modelplane
```

```bash
helm repo add crossplane-stable https://charts.crossplane.io/stable
helm repo update crossplane-stable
helm install crossplane crossplane-stable/crossplane \
  --namespace crossplane-system --create-namespace \
  --set "args={--enable-dependency-version-upgrades}" \
  --wait
```

Apply the bootstrap resources. They grant Crossplane the permissions it needs to
manage your cluster:

```shell
kubectl apply -f {{< manifest-url "getting-started/prerequisites.yaml" >}}
```

{{< expand "Review the prerequisites manifest" >}}
{{< manifests "getting-started/prerequisites.yaml" >}}
{{< /expand >}}

## Install Modelplane

The Modelplane Configuration adds the Modelplane APIs and the composition
functions that reconcile them:

{{< manifests "getting-started/configuration.yaml" >}}

Wait until the configuration is healthy:

```bash
kubectl wait configuration/modelplane --for=condition=Healthy --timeout=5m
```

## Next step

The control plane is running but has nothing to schedule against yet. In the
next step, you'll [build the platform]({{< ref
"getting-started/build-the-platform.md" >}}) to provision a GPU cluster and
publish what hardware it offers.


---

# Qwen3-8B

Source: https://docs.modelplane.ai/examples/qwen3-8b/

<!-- vale write-good.Passive = NO -->
An 8.2B dense chat model on a single NVIDIA L4. The smallest recipe: one
`Standalone` engine, no cache, weights pulled straight from Hugging Face.

This recipe was run end to end; the `InferenceClass` and `ModelDeployment` are
the exact manifests from that run. Apply the platform side first, then the ML
side.

## Platform

{{< manifests "examples/qwen3-8b/inference-class.yaml" >}}

{{< manifests "examples/qwen3-8b/inference-cluster.yaml" >}}

## Deployment

{{< manifests "examples/qwen3-8b/model-deployment.yaml" >}}

{{< manifests "examples/qwen3-8b/model-service.yaml" >}}
<!-- vale write-good.Passive = YES -->


---

# Set Up the Gateway

Source: https://docs.modelplane.ai/platform/inference-gateway/

**API:** [`modelplane.ai/v1alpha1` · InferenceGateway]({{< ref "/reference/inferencegateways" >}})
<!-- vale write-good.Passive = NO -->
The `InferenceGateway` sets up the control plane's front door: one unified,
OpenAI-compatible address that every `ModelService` is exposed through, routing
each request on to the inference cluster serving it.

The `InferenceGateway` is a singleton: create exactly one, named `default`, on
your Modelplane control plane. It fronts every inference cluster in the fleet, so
you don't create one per cluster.

The `backend` field selects which gateway runs it. `Traefik` is the only value
today.

On a cloud cluster with a native LoadBalancer controller, the gateway's `Service`
gets an external address on its own. On kind or bare-metal, where there's no such
controller, set `spec.traefik.loadBalancer: MetalLB` and give it an address pool
in `spec.traefik.metallb.addressPool` so the gateway gets an IP. See the example
below.

Once the gateway is ready, read its external address from `status.address`:

```bash
kubectl get ig default -o jsonpath='{.status.address}'
```

That address is the host of every `ModelService` URL
(`http://<address>/<namespace>/<service>`), so it's what you hand to ML teams.
## Example

{{< manifests "concepts/inference-gateway.yaml" >}}
<!-- vale write-good.Passive = YES -->


---

# Why Modelplane

Source: https://docs.modelplane.ai/overview/why/

<!-- vale write-good.TooWordy = NO -->
<!-- vale write-good.Passive = NO -->
Open-weight models are becoming the choice for organizations: they can be
post-trained, including with reinforcement learning, to compete with frontier
models, and they put cost, governance, and data sovereignty back under the
organization's control. As they do, platform teams are 
increasingly asked to provide GPU inference to their ML and development teams the 
same way they already provide cloud infrastructure.

## Kubernetes is becoming the default orchestrator

Kubernetes is rapidly becoming the default orchestrator for inference. The broader 
cloud-native community is investing heavily to make it a first-class platform for
AI workloads, adding device-aware scheduling, multi-node inference, distributed
serving, and accelerator management. The major open source inference projects are
converging on it; among them are vLLM, SGLang, NVIDIA Dynamo, llm-d, Ray, Slurm,
KubeAI, and Kueue. Neoclouds like Baseten and CoreWeave have standardized on
Kubernetes for their own operations. Inside a single cluster, the open source
stack is now strong.

## Inference is a fleet problem

Inference, however, almost always runs across more than one cluster. Accelerator
availability scatters capacity across hardware types, providers, and regions.
Sovereignty and compliance pin workloads to specific locations. Operators run
across multiple clouds and on-premise environments. Large clusters
concentrate failure and risk, so fleets of smaller clusters are often preferable,
and inference workloads don't bin-pack the way other workloads do.

Inference grows into a fleet, and a new set of problems appears above
any single cluster:

- Deciding where each model runs across available capacity.
- Optimizing placement across heterogeneous accelerators.
- Failing over across clouds and regions.
- Routing by cost, latency, and sovereignty requirements.
- Provisioning new capacity as demand grows.
- Caching and distributing model weights across the fleet.
- Managing the lifecycle of models, clusters, and infrastructure as one system.

Open source addresses pieces of this but none brings all the pieces together in a
fleet-wide system of record that manages placement, caching, capacity, policy, and
routing across an entire fleet. The labs, hyperscalers, and managed providers have
all solved these problems in a proprietary way, but the open equivalent does not
yet exist.

## Modelplane extends Kubernetes to manage the fleet

Modelplane does for the fleet what Kubernetes does for the cluster. It's the open
source control plane above your inference clusters across cloud, neocloud, and
on-premise: it places model deployments, autoscales replicas, provisions and
manages the infrastructure underneath, caches and distributes model weights, and
routes inference through one unified gateway with fallback to managed providers.
It turns "I need this model served" into a stable endpoint for any ML team.

Modelplane composes these projects rather than replacing them, and stays neutral
across models, accelerators, clouds, and serving stacks. It's built on
[Crossplane](https://crossplane.io) and extends Kubernetes to manage inference
at the fleet level. Modelplane is open source, Apache 2 licensed, and we plan to
donate it to a neutral open source foundation later this year.

{{< cardgroup cols="2" >}}
{{< card title="How Modelplane works" href="/overview/how-it-works/" >}}
The architecture, the resources, and what happens when you deploy a model.
{{< /card >}}
{{< card title="FAQ" href="/overview/faq/" >}}
How Modelplane compares to cluster orchestrators and managed providers, and what it requires.
{{< /card >}}
{{< /cardgroup >}}
<!-- vale write-good.TooWordy = YES -->
<!-- vale write-good.Passive = YES -->


---

# Build the platform

Source: https://docs.modelplane.ai/getting-started/build-the-platform/

This is the platform team's side of Modelplane. You set up the gateway that
fronts your models, give the control plane cloud credentials, and register your
first GPU cluster: a hardware profile published as an `InferenceClass` and an
`InferenceCluster` that offers it.

In the next step, the ML team will create a model deployment that schedules
against this capacity without knowing which cluster it runs on.

## Prerequisites

{{< tabs >}}
{{< tab "EKS" >}}
- An AWS account with permissions to create EKS clusters, VPCs, and IAM roles
- AWS access key ID and secret access key
{{< /tab >}}
{{< tab "GKE" >}}
- A GCP account with permissions to create GKE clusters, VPCs, and IAM roles
- A GCP service account JSON key
{{< /tab >}}
{{< /tabs >}}

## Set up the InferenceGateway

<!-- vale ai-tells.EmptyPadding = NO -->
The `InferenceGateway` installs Traefik Proxy and MetalLB on the control plane.
Traefik routes inference traffic to model replicas. MetalLB assigns Traefik's
`LoadBalancer` service an external IP on kind, which doesn't have a cloud load
balancer. You need one named `default` per control plane.
<!-- vale ai-tells.EmptyPadding = YES -->

If you run the control plane on a cloud cluster with native `LoadBalancer`
support, omit the `loadBalancer` field.

{{< manifests "getting-started/inference-gateway.yaml" >}}

Wait until the gateway is ready:

```bash
kubectl wait --for=condition=Ready ig/default --timeout=5m
```

## Configure cloud credentials

Give the control plane credentials so it can provision clusters in your cloud
account.

{{<tabs>}}
{{< tab "EKS" >}}
Create an AWS credentials file:

{{< editCode >}}
```ini
[default]
aws_access_key_id = $@<aws_access_key>$@
aws_secret_access_key = $@<aws_secret_key>$@
```
{{< /editCode >}}

Create a Kubernetes secret:

{{< editCode >}}
```bash
kubectl create secret generic aws-creds \
  --from-file=credentials=$@</path/to/aws-credentials>$@ \
  -n crossplane-system
```
{{< /editCode >}}

Apply the `ClusterProviderConfig` referencing your secret:

{{< manifests "getting-started/clusterproviderconfig-aws.yaml" >}}
{{< /tab >}}

{{<tab "GKE" >}}
Create a Kubernetes secret:

{{< editCode >}}
```bash
kubectl create secret generic gcp-creds \
  --from-file=credentials=$@<path/to/gcp-key>$@.json \
  -n crossplane-system
```
{{< /editCode >}}

Apply the `ClusterProviderConfig`, setting `projectID` to your GCP project:

{{< manifests path="getting-started/clusterproviderconfig-gke.yaml" apply="false" >}}

{{< editCode >}}
```bash
curl -fsSL {{< manifest-url "getting-started/clusterproviderconfig-gke.yaml" >}} \
  | sed 's/my-gcp-project/$@<your-gcp-project>$@/' \
  | kubectl apply -f -
```
{{< /editCode >}}
{{< /tab >}}
{{</tabs>}}

## Publish hardware and register the cluster

The `InferenceClass` describes a hardware profile and how to provision it. The
`InferenceCluster` registers a cluster that offers it. Apply both:

{{< tabs >}}
{{< tab "EKS">}}
{{< manifests "getting-started/eks/platform.yaml" >}}

Modelplane provisions the cluster. This takes about 15 minutes:

```bash
kubectl wait --for=condition=Ready ic/eks-us-east --timeout=20m
```
{{< /tab >}}

{{< tab "GKE" >}}
Apply the manifest, setting the cluster's `project` to your GCP project:

{{< manifests path="getting-started/gke/platform.yaml" apply="false" >}}

{{< editCode >}}
```bash
curl -fsSL {{< manifest-url "getting-started/gke/platform.yaml" >}} \
  | sed 's/my-gcp-project/$@<your-gcp-project>$@/' \
  | kubectl apply -f -
```
{{< /editCode >}}

Modelplane provisions the cluster. This takes about 15 minutes:

```bash
kubectl wait --for=condition=Ready ic/starter --timeout=20m
```
{{< /tab >}}
{{< /tabs >}}

{{< hint "note" >}}
Modelplane is reconciling the infrastructure against the source of truth, the
manifest you just applied.

While you wait, Modelplane is creating the EKS or GKE cluster and its GPU node
pool, then installing the inference stack with LeaderWorkerSet for multi-node
serving, llm-d for inference-aware routing, Envoy Gateway for traffic
management, and the storage class for model weights. This is the same reconciliation loop Crossplane uses to configure other 
infrastructure, extended to the inference layer.
{{< /hint >}}

Once the cluster is `Ready` the ML team can deploy a model on it.

{{< hint "note" >}}
A cloud GPU cluster costs money while it runs. To stop the tour and resume
later, follow [Clean up]({{< ref "getting-started/clean-up.md" >}}).
{{< /hint >}}

## Next step

Now that the platform is provisioned, the ML team can [deploy a model]({{< ref
"getting-started/deploying-a-model.md" >}}) by describing what the model needs, not the infrastructure.


---

# Define Hardware Classes

Source: https://docs.modelplane.ai/platform/inference-class/

**API:** [`modelplane.ai/v1alpha1` · InferenceClass]({{< ref "/reference/inferenceclasses" >}})

<!-- vale write-good.Passive = NO -->
An `InferenceClass` is a tested recipe for a GPU node pool. It bundles:


- **Devices**: the node's hardware as a list of Dynamic Resource Allocation (DRA)
  style devices, each with a driver, count, typed attributes, and capacity. The
  scheduler matches a member's `nodeSelector` against these devices, and GPUs
  bind to pods through DRA.
- **Provisioning** (optional): how to create a node pool of this class on a
  specific cloud. Classes without provisioning are for existing clusters where
  the pool already exists.

Different clouds and GPU types imply different classes. A GKE L4 pool is
`gke-l4-1x-g2`. A bare-metal H100 pool is `h100-8x-byo` (no provisioning).

## Describing devices

A class's `devices` follow Kubernetes
[Dynamic Resource Allocation](https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/)
(DRA), the mechanism modern Kubernetes uses to match GPUs to pods. Each device
has a `driver` (the vendor that owns it, such as `gpu.nvidia.com`), a `count`
(how many a node has), typed `attributes` (such as `architecture`), and
`capacity` (quantities, such as `memory`). This mirrors the shape the GPU's DRA
driver publishes on a real node, so what you declare here is what an ML team's
`nodeSelector` matches against and what DRA binds at runtime.

You author the attribute and capacity keys, and there's no fixed list. Pick the
ones an ML team would reasonably select on, the GPU memory, the architecture, the
compute capability, using the same names the driver reports.

## DRA and synthetic devices

Each device sets a `claim` discriminator:

- **`DRA`** (the default) is hardware a real DRA driver exposes, today GPUs.
  Modelplane both schedules against it and binds it to pods.
- **`Synthetic`** is described for scheduling only, never claimed. Use it for
  hardware that matters for placement but has no DRA driver yet, like an
  InfiniBand fabric.

## The device contract

The `driver`, attribute keys, and capacity keys a class declares are a contract
with the ML team: a `ModelDeployment`'s `nodeSelector` matches a pool only if the
class publishes the attributes and capacity it asks for. ML teams write those
matches as [CEL](https://cel.dev/) selectors over the keys you publish here. For
GPUs, these keys should mirror what the DRA driver reports, so the same selector
that places a deployment on the pool also binds the right device.

Publish a device's real usable capacity, not its nominal spec. An `80GB` H100
reports about `81559Mi` of usable memory, so a class that declares `80Gi` would
let a `nodeSelector` asking for `>= 80Gi` match the pool but then fail to bind the
GPU.

## Examples

{{< tabs >}}
{{< tab "GKE L4" >}}
{{< manifests "concepts/inference-class-gke-l4.yaml" >}}
{{< /tab >}}
{{< tab "EKS L4" >}}
{{< manifests "concepts/inference-class-eks-l4.yaml" >}}
{{< /tab >}}
{{< tab "H100 bare-metal" >}}
{{< manifests "concepts/inference-class-h100-byo.yaml" >}}
{{< /tab >}}
{{< /tabs >}}
<!-- vale write-good.Passive = YES -->


---

# Expose a Model

Source: https://docs.modelplane.ai/models/model-service/

**API:** [`modelplane.ai/v1alpha1` · ModelService]({{< ref "/reference/modelservices" >}})
<!-- vale write-good.Passive = NO -->
A [`ModelDeployment`]({{< ref "model-deployment.md" >}}) serves a model, but its
replicas are scattered across the fleet with no single address. A `ModelService`
gives them one: a stable, unified, OpenAI-compatible URL that load-balances
across every replica, wherever it runs.

A service selects what to route to by label. Behind the scenes, Modelplane
creates one `ModelEndpoint`, a single reachable backend, for each replica of a
deployment and labels it. Two of those labels carry routing intent:

- `modelplane.ai/deployment`: the deployment the replica belongs to.
- `modelplane.ai/cluster`: the cluster the replica runs on.

Modelplane creates an endpoint only once its replica is Ready, serving and
reachable, and withdraws it if the replica later goes unhealthy. A service only
ever routes to replicas that can actually answer, so a deployment that's still
starting or scaling up has fewer endpoints behind its URL until those replicas
come up. You don't create endpoints yourself. You point a service at them.

`spec.endpoints` is a list, and the entries combine: the service routes to every
endpoint that any entry matches. The patterns below build on that.

## Route to a whole deployment

The common case: one selector matching a deployment's name reaches every replica,
wherever in the fleet they run.

```yaml {nocopy=true}
spec:
  endpoints:
  - selector:
      matchLabels:
        modelplane.ai/deployment: qwen3-8b   # every replica of this deployment
```

## Route to part of a deployment

Add a second label to narrow within a deployment. A selector matches an endpoint
only when all its labels match, so pairing the deployment with a cluster routes to
just that cluster's replicas. This is how you take a cluster out of service
without redeploying: point the service at the clusters you want and leave one out,
and traffic drains to the rest.

```yaml {nocopy=true}
spec:
  endpoints:
  # Only the replicas on prod-us-east, e.g. while draining another cluster.
  - selector:
      matchLabels:
        modelplane.ai/deployment: qwen3-8b
        modelplane.ai/cluster: prod-us-east
```

## Route across several deployments

Give more than one entry to front several deployments behind the same URL. Each
entry contributes its matched endpoints, and traffic spreads evenly across every
one.

```yaml {nocopy=true}
spec:
  endpoints:
  - selector:
      matchLabels:
        modelplane.ai/deployment: qwen3-8b
  - selector:
      matchLabels:
        modelplane.ai/deployment: qwen3-8b-v2
```

This is the shape an A/B test or a canary rollout would take, but note traffic is
split **evenly** across the matched endpoints today. Weighting one entry over
another, to send, say, 5% of traffic to a canary, is tracked in
[#90](https://github.com/modelplaneai/modelplane/issues/90). Until then the split
follows endpoint counts, not a ratio you set.

The entries don't have to be deployments. One can select a manually created
[ModelEndpoint]({{< ref "model-endpoint.md" >}}) that points at an external
provider, so a service can send overflow or break-glass traffic to a SaaS
endpoint alongside your own replicas:

```yaml {nocopy=true}
spec:
  endpoints:
  - selector:
      matchLabels:
        modelplane.ai/deployment: kimi-k2
  - selector:
      matchLabels:
        modelplane.ai/external-provider: together
```

Endpoints with different path layouts coexist behind the one URL.

## Sending a request

The service's public address is on `status.address`, in the form
`http://<gateway>/<namespace>/<service-name>`:

```bash
ADDRESS=$(kubectl get ms qwen -n ml-team -o jsonpath='{.status.address}')
```

Append the OpenAI path and send a request. The `model` field is the name the
engine serves (its `--served-model-name`, or the model's Hugging Face id if you
didn't set one):

```bash
curl "$ADDRESS/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'
```

## Alternate APIs

We call the endpoint OpenAI-compatible because the engines are, not because
Modelplane imposes it. The route matches the `/<namespace>/<service>/` prefix and
preserves the path below it on the way to the engine, so any API the engine serves
is reachable on the same URL.

Take a vLLM replica that also serves the Anthropic Messages API. It answers on
`.../v1/messages`, so a client that speaks it (including Claude Code, via
`ANTHROPIC_BASE_URL`) talks to it directly. The engine's operational paths come
through the same way: `.../health` and the Prometheus `.../metrics` are reachable
on the service URL.

There's one exception, and it's set by the deployment rather than the service.
[Disaggregated serving]({{< ref "model-deployment.md#disaggregated-serving" >}})
reads OpenAI-format request bodies to pick a prefill and decode worker, so a
request in another API shape still reaches the engine but skips that
cache-aware routing. Unified serving forwards every API shape the same way.

## Example

{{< manifests "concepts/model-service.yaml" >}}
<!-- vale write-good.Passive = YES -->


---

# How it schedules

Source: https://docs.modelplane.ai/architecture/scheduling/

**API:** [`modelplane.ai/v1alpha1` · ModelDeployment]({{< ref "/reference/modeldeployments" >}})
<!-- vale write-good.Passive = NO -->
When an ML team creates a [ModelDeployment]({{< ref "/models/model-deployment.md" >}}),
the fleet scheduler decides which cluster each replica runs on and which node
pool each engine uses. Platform teams don't drive it directly, but what they
publish, the clusters, their labels, and each pool's
[InferenceClass]({{< ref "/platform/inference-class.md" >}}), is exactly what the
scheduler matches against. This page explains how it places work and where it
deliberately stops short, so you can reason about why a deployment landed where it
did.

## A pure function of observed state

The scheduler recomputes the whole placement from scratch on every reconcile. It
reads the deployment, every `InferenceCluster` with its published capacity, and
every existing `ModelReplica`, and returns a placement. Given the same inputs it
returns the same placement, so it's safe to run continuously.

The key consequence is stability. Existing replicas are *inputs*, not decisions.
A healthy replica is never moved to improve the global picture, even if a better
cluster appears later. This keeps placement from churning underneath a running
deployment.

## Two-level matching

The scheduler picks a `(cluster, pool)` for each replica in two stages, matching
against what the platform team published.

1. **Clusters** are filtered by `clusterSelector.matchLabels` against the
   standard Kubernetes labels on each `InferenceCluster`: tier, region, provider,
   compliance posture. This is organizational metadata, so string equality is
   enough. An unset selector matches every cluster.
2. **Pools** are filtered by matching each device request in a member's
   `nodeSelector.devices` against the devices a pool's `InferenceClass` publishes.
   A request is a real DRA request: a `count` and CEL selectors over a device's
   attributes and capacity, such as "a GPU with at least 141Gi of memory." A pool
   fits a member when it has devices satisfying every request, with `count` to
   cover them.

The CEL is the same expression an ML engineer would write in a DRA
`ResourceClaim`, evaluated against the devices the `InferenceClass` declares. The
keys a platform team puts on a class are the contract: a `nodeSelector` matches a
pool only if the class publishes the attributes and capacity it asks for.

## Co-scheduling and pools

A replica is a set of engines placed together on one cluster. Within a replica,
every member of a single engine is placed on **one** pool: each member carries
its own `nodeSelector`, but the scheduler requires a single pool that satisfies
them all.

It works this way because a gang's members coordinate over their pool's
interconnect fabric, and the scheduler can't reason about fabric. Pool identity
is the finest grain it has. An engine split across pools risks landing its
members on different fabrics. The collective then never forms, and the gang hangs
with no clear error. To avoid that, the scheduler never splits an engine: an engine that no
single pool satisfies isn't scheduled on that cluster. Different engines of the same replica
can use different pools, but all on the same cluster.

```mermaid
graph TD
    subgraph cluster ["One InferenceCluster"]
        subgraph pool1 ["Pool A"]
            L["prefill engine\nLeader + Worker\n(whole gang, one pool)"]
        end
        subgraph pool2 ["Pool B"]
            D["decode engine\nStandalone"]
        end
    end

    R["ModelReplica"] --> L
    R --> D
```

A member with no `nodeSelector` claims no devices. It matches the engine's pool
at no node cost and rides along on the gang's nodes, packed there by the
cluster's own scheduler.

## Counting capacity in nodes

Capacity is gated on **nodes**, not on individual GPUs. The only number the
scheduler reads from a member is its node cost:

```text
nodes = pods × copies
pods  = 1 for a Standalone or Leader, or worker.nodes for a Worker
```

A member that resolves no `claim: DRA` device, because it carried no
`nodeSelector` or matched only synthetic devices, costs zero nodes. The scheduler
sums the cost of a replica's members and places the replica only where every
engine's pool has enough free nodes, tracking a running ledger so it never
overcommits a cluster.

This accounting is deliberately coarse. The control-plane scheduler answers
"could this cluster plausibly host this replica," not "exactly which GPU does
each pod get." Device-level contention between deployments is left to DRA
admission on the workload cluster, which is authoritative: it rejects a pod whose
`ResourceClaim` can't be satisfied, and the next reconcile sees the updated
state.

## Pinning placement to a pool

The scheduler's pool choice is enforced, not advisory. Each scheduled pod carries
a Kubernetes `nodeSelector` on the `modelplane.ai/pool` node label, so it can only
land on the pool the scheduler chose. Without it, the cluster's scheduler could
place a pod on any pool whose devices match its DRA claim, and the fleet's
per-pool accounting would drift from where pods actually run.

Modelplane labels the nodes of every pool it provisions. On a BYO
(`source: Existing`) cluster it doesn't provision the nodes, so the operator must
label each pool's nodes `modelplane.ai/pool=<nodePools[].name>` themselves, or
worker pods for that pool stay `Pending`.

## Scaling, retention, and re-placement

Scheduling runs in two phases each reconcile:

<!-- vale write-good.TooWordy = NO -->
- **Retain.** Each existing replica keeps its cluster if the cluster still exists
  and every member's pinned pool still matches its (possibly edited)
  `nodeSelector`. A degraded cluster, one that's not Ready or has no gateway
  address, is still retained; transient outages surface through the deployment's
  conditions, not re-placement.
- **Fill.** If the deployment wants more replicas than were retained, the
  shortfall is placed one at a time, each onto the eligible cluster hosting the
  fewest of this deployment's replicas, spreading before packing. If it wants
  fewer, the highest-index replicas are dropped first.
<!-- vale write-good.TooWordy = YES -->

A replica never changes cluster. If its cluster is deleted, the replica stops
being emitted, Crossplane garbage-collects it, and the fill phase mints a fresh
replica elsewhere. Moving is always delete-plus-create, mirroring how Kubernetes
treats a pod whose node is gone.

## Known limitations

The scheduler is built to be conservative and predictable rather than optimal.
Two limits follow from that, both tracked for future work:

- **A whole node is charged per pod**
  ([#172](https://github.com/modelplaneai/modelplane/issues/172)). A pod that
  claims one GPU of an eight-GPU node still charges the whole node in the
  scheduler's accounting. This is safe, it can only under-count a pool's
  capacity, never overcommit it, but it can strand GPUs on deployments of
  sub-node engines.
- **An engine can't span pools, even on one fabric**
  ([#149](https://github.com/modelplaneai/modelplane/issues/149)). Because the
  scheduler has no concept of fabric, it refuses to split a gang across pools at
  all. That forecloses a legitimate case, GPU workers on one pool and a no-GPU
  coordinator on another within the same fabric, until fabric-aware placement
  lands.
<!-- vale write-good.Passive = YES -->


---

# How Modelplane works

Source: https://docs.modelplane.ai/overview/how-it-works/

<!-- vale write-good.Passive = NO -->
Modelplane runs as a control plane on its own cluster, the **control cluster**,
above the **inference clusters** that actually serve models. It's built on
[Crossplane](https://crossplane.io): platform teams and developers describe what
they want as Kubernetes resources, and Modelplane continuously reconciles the
fleet to match, composing the clusters, scheduling replicas, and exposing
endpoints. This page is the full tour. It covers the architecture and resources, then walks through what happens when you deploy a model.

## Modelplane API

Modelplane's API is two sets of resources, one per team, with everything in
between filled in for you. Platform teams describe the fleet, ML teams describe a
model, and Modelplane composes the rest.

<div class="mp-lanes">
  <div class="mp-lane mp-lane--platform">
    <div class="mp-lane-title">Platform team creates</div>
    <a class="mp-chip" href="{{< ref "/platform/inference-gateway" >}}">
      <span class="mp-chip-name">InferenceGateway</span>
      <span class="mp-chip-desc">The unified, OpenAI-compatible entry point on the control cluster.</span>
    </a>
    <a class="mp-chip" href="{{< ref "/platform/inference-class" >}}">
      <span class="mp-chip-name">InferenceClass</span>
      <span class="mp-chip-desc">A tested hardware recipe for a node pool: the devices it offers and how to provision it.</span>
    </a>
    <a class="mp-chip" href="{{< ref "/platform/inference-cluster" >}}">
      <span class="mp-chip-name">InferenceCluster</span>
      <span class="mp-chip-desc">A Kubernetes cluster in the fleet, provisioned by Modelplane or brought as-is.</span>
    </a>
  </div>
  <div class="mp-lane mp-lane--ml">
    <div class="mp-lane-title">ML team creates</div>
    <a class="mp-chip" href="{{< ref "/models/model-deployment" >}}">
      <span class="mp-chip-name">ModelDeployment</span>
      <span class="mp-chip-desc">A model to serve, with engines, replica count, and an optional cache.</span>
    </a>
    <a class="mp-chip" href="{{< ref "/models/model-service" >}}">
      <span class="mp-chip-name">ModelService</span>
      <span class="mp-chip-desc">One OpenAI-compatible endpoint, load-balanced across the endpoints it selects.</span>
    </a>
    <a class="mp-chip" href="{{< ref "/models/model-cache" >}}">
      <span class="mp-chip-name">ModelCache</span>
      <span class="mp-chip-desc">Model weights staged once per cluster on shared storage.</span>
    </a>
  </div>
  <div class="mp-lane mp-lane--composed">
    <div class="mp-lane-title">Modelplane composes</div>
    <a class="mp-chip" href="{{< ref "/models/model-deployment" >}}">
      <span class="mp-chip-name">ModelReplica</span>
      <span class="mp-chip-desc">One complete serving instance on a specific cluster.</span>
    </a>
    <a class="mp-chip" href="{{< ref "/models/model-endpoint" >}}">
      <span class="mp-chip-name">ModelEndpoint</span>
      <span class="mp-chip-desc">A reachable endpoint, one per replica or set manually for an external provider.</span>
    </a>
  </div>
</div>

The hierarchy mirrors Kubernetes core one scope up: `ModelDeployment` →
`ModelReplica` → `ModelService` → `ModelEndpoint` parallels `Deployment` → `Pod` → `Service` →
`Endpoint`, across a fleet instead of within a single cluster.

## What the control plane reconciles

Once the resources exist, Modelplane keeps the fleet matching them. Five concerns
run continuously:

1. **Provisioning.** From an `InferenceCluster`, Modelplane creates a full cluster 
   and its GPU node pools, or brings in a cluster you already run on
   any Kubernetes, and installs the serving stack on each.
2. **Scheduling.** A two-level scheduler places work: it pins each `ModelReplica`
   to a cluster and pool whose hardware meets the model's requirements, then the
   cluster's own scheduler binds the GPUs to the serving pods through DRA.
3. **Autoscaling.** Replicas are the scaling axis. Scaling a `ModelDeployment`'s
   `spec.replicas` adds or removes whole serving instances through the standard
   Kubernetes scale subresource, so `kubectl scale` or a KEDA `ScaledObject` work
   out of the box.
4. **Routing.** A `ModelService` exposes one OpenAI-compatible endpoint through
   the gateway and load-balances across the deployment's `ModelEndpoints`,
   wherever their replicas run. `ModelEndpoints` can also point at external
   inference services.
5. **Caching.** A `ModelCache` stages model weights on cluster storage once, so
   serving pods read them locally instead of re-downloading on every start.

## Universal compatibility

Modelplane is deliberately unopinionated about the engine. A `ModelDeployment`
describes the *shape* of a deployment, how many pods, on how many nodes, with
which devices, and nothing about how the engine runs internally. The engine flags
you write carry parallelism (tensor, pipeline, data, expert), quantization, and KV
transfer; Modelplane never injects them.

This is what lets one API serve any container-based engine and any topology
without special cases. Modelplane composes the engine onto the right cluster
resource and injects almost nothing, just the address a multi-node leader is
reachable at, so a worker can join it. New engines and new parallelism strategies
work without a change to Modelplane. The community publishes recipes (worked, copyable
manifests) to bridge the gap that flexibility leaves, rather than hard-coding
choices into the API.

## Fleet scheduler

For each replica, the scheduler picks a `(cluster, pool)` in two steps:

1. **Filter clusters** by `clusterSelector.matchLabels` against the standard
   Kubernetes labels on each `InferenceCluster`, the organizational metadata:
   tier, region, provider, compliance posture.
2. **Filter pools** by matching each device request in the deployment's
   `nodeSelector.devices` against the pool's `InferenceClass`. A request is based
   on DRA: a `count` and CEL selectors over a device's attributes and capacity, like
   "a GPU with at least 141Gi of memory." A pool fits when it has the devices the
   model asks for and enough free nodes to hold a replica.

Capacity is accounted at the node level across the fleet, so Modelplane never
overcommits a pool. Replicas are pinned to their cluster once placed and stay
there across reconciles; if a cluster is deleted, the scheduler re-places its
replicas elsewhere. [How it schedules]({{< ref "/architecture/scheduling.md" >}})
covers the placement rules and their limits in full.

## Deploying a model

Creating a `ModelDeployment` kicks off the loop end to end. The scheduler
discovers the ready clusters (filtered by your label selector if you set one),
matches each engine's device requests against their pools, and pins each replica
to a cluster that fits. Modelplane composes a `ModelReplica` on each chosen
cluster, turns it into the right serving workload there, creates a `ModelEndpoint`
per replica, and your `ModelService` routes traffic across them through one stable
endpoint on the gateway. Scale the deployment up or down and the same loop
re-converges.

## Serving topologies

A single-node deployment composes to a Kubernetes Deployment fronted by a
service. When a model is too large for one node, an engine becomes a gang: a
`Leader` member and one or more `Worker` members that Modelplane composes into a
LeaderWorkerSet, serving the model together across nodes. Gang deployments
should stage their weights through a `ModelCache`, so the pods share one copy
instead of each pulling the same model.

Disaggregated serving splits prefill and decode into separate engines
(`serving.mode: PrefillDecode`) that run on the same cluster and hand off the KV
cache between them. Modelplane wires up the cluster-edge routing that pairs each
request's prefill and decode; the engines carry the KV-transfer flags. Both are
described in full in the [model deployment docs]({{< ref "/models/model-deployment" >}}).

## Next steps

{{< cardgroup cols="2" >}}
{{< card title="FAQ" href="/overview/faq/" >}}
Quick answers on how Modelplane compares and what it requires.
{{< /card >}}
{{< card title="Get started" href="/getting-started/" >}}
Put it together: deploy Modelplane and serve a model.
{{< /card >}}
{{< /cardgroup >}}
<!-- vale write-good.Passive = YES -->


---

# Qwen3-Coder-480B

Source: https://docs.modelplane.ai/examples/qwen3-coder/

<!-- vale write-good.Passive = NO -->
A 480B code MoE (35B active). Two validated shapes: the BF16 weights span two
H200 nodes as a gang over EFA, served from a `ModelCache`; the FP8 checkpoint
fits one node, so it runs as a single `Standalone` engine on SGLang with no
cache.

Both shapes were run end to end; the `InferenceClass` and `ModelDeployment` are
the exact manifests from those runs. Apply the platform side first, then the ML
side. The `InferenceCluster` carries an EC2 capacity reservation placeholder to
edit before applying.

## Platform

{{< tabs >}}
{{< tab "Multi-node (BF16)" >}}
{{< manifests "examples/qwen3-coder/inference-class.yaml" >}}

{{< manifests path="examples/qwen3-coder/inference-cluster.yaml" apply="false" >}}

{{< editCode >}}
```bash
curl -fsSL {{< manifest-url "examples/qwen3-coder/inference-cluster.yaml" >}} \
  | sed 's/cr-0123456789abcdef0/$@<your-reservation-id>$@/' \
  | kubectl apply -f -
```
{{< /editCode >}}
{{< /tab >}}
{{< tab "Single-node (FP8)" >}}
{{< manifests "examples/qwen3-coder/inference-class-fp8.yaml" >}}
{{< /tab >}}
{{< /tabs >}}

## Deployment

{{< tabs >}}
{{< tab "Multi-node (BF16)" >}}
{{< manifests "examples/qwen3-coder/model-cache.yaml" >}}

{{< manifests "examples/qwen3-coder/model-deployment.yaml" >}}

{{< manifests "examples/qwen3-coder/model-service.yaml" >}}
{{< /tab >}}
{{< tab "Single-node (FP8)" >}}
{{< manifests "examples/qwen3-coder/model-deployment-fp8.yaml" >}}

{{< manifests "examples/qwen3-coder/model-service-fp8.yaml" >}}
{{< /tab >}}
{{< /tabs >}}
<!-- vale write-good.Passive = YES -->


---

# Cache Model Weights

Source: https://docs.modelplane.ai/models/model-cache/

<!-- vale write-good.Passive = NO -->
**API:** [`modelplane.ai/v1alpha1` · ModelCache]({{< ref "/reference/modelcaches" >}})

A `ModelCache` stages a model's weights on shared workload-cluster storage,
fetched once from the configured source rather than downloaded again on every pod
start. `ModelDeployments` reference a cache via `spec.modelCacheRef.name`, and
Modelplane mounts it at `/mnt/models` in every serving pod, shared across the
pods of a multi-node engine. The engine reads weights locally from the mount.

`ModelCache` is recommended for multi-node deployments and optional for
single-node cold-start optimization.

## What to cache

The required `source` enum names the kind, with the matching source object set
alongside it. Setting `source: HuggingFace` selects `spec.huggingFace`, which
carries the `repo` to fetch, an optional `revision` (branch, tag, or commit), and
`sizeGiB`, how much storage the weights get on each cluster. Size it to the
model, since a value below the model's size leaves no room to stage the weights.
`HuggingFace` is the only source today.

The cache mounts at `/mnt/models` on every consuming pod, so the engine's args
reference that path (`--model=/mnt/models` for vLLM) rather than the source.

## Authenticating

A gated or private model needs a credential to fetch. When a cache stages the
weights, the credential lives on the cache: set `authSecret` to name a Secret in
the cache's namespace, and Modelplane propagates it to every cluster the cache
stages to, for the hydration to read.

Create the Secret once on the control plane, then reference it:

```bash
kubectl create secret generic hf-token \
  --namespace ml-team \
  --from-literal=HF_TOKEN=hf_xxxxxxxx
```

```yaml {nocopy=true}
spec:
  source: HuggingFace
  huggingFace:
    repo: Qwen/Qwen3-Coder-480B-A35B-Instruct
    authSecret:
      name: hf-token         # a Secret in this ModelCache's namespace
      key: HF_TOKEN          # defaults to HF_TOKEN
    sizeGiB: 1100
```

Without a cache, the engine fetches the model itself at startup, so the
credential goes on the `ModelDeployment` instead, as `HF_TOKEN` in the engine
container's `env`.

## Where to cache

An optional `clusterSelector` scopes where the cache is staged. Omitting it
stages the cache on every cluster in the fleet; setting `matchLabels` restricts
it to clusters carrying those labels. A `ModelDeployment` that references the cache
places *new* replicas only onto clusters within this footprint, so narrowing the
selector also narrows where replicas can land: a replica never schedules to a
cluster the cache didn't stage to. Replicas already running are left where they
are.

## Loading from cache

A cache only pays off if the engine reads from it quickly. With its default
loader an engine can read a large model from shared storage slowly enough that
the cache makes cold starts *worse* than fetching the model directly, since you
pay to hydrate the cache and then wait on a slow read. Choose a fast loader with
your engine flags.

For vLLM on EKS, `--load-format=runai_streamer` reads from the EFS-backed cache
dramatically faster than the default loader (minutes rather than tens of
minutes for a large model), tuned further with `--model-loader-extra-config`:

```yaml {nocopy=true}
args:
- --model=/mnt/models
- --load-format=runai_streamer
- --model-loader-extra-config={"concurrency":16,"distributed":true}
```

The right loader and settings depend on the engine and the storage backend, so
treat these as a starting point and measure your own cold-start time. The
[Kimi-K2 example]({{< ref "/examples/kimi-k2" >}}) uses this configuration end to
end.

## Storage prerequisites

<!-- vale Google.Acronyms = NO -->
The cache PVC needs a `ReadWriteMany` (RWX) StorageClass on the workload cluster.
What the platform admin must set up depends on the cloud:
<!-- vale Google.Acronyms = YES -->

- **GKE** and **EKS:** auto-provisioned. Nothing for the admin to do.
- **Existing:** the admin sets up a `ReadWriteMany` StorageClass on the cluster.

Either way, your `ModelCache` and `ModelDeployment` specs are the same. How
storage is provided on each cluster source, and how to bring your own backend, is
covered in [Register a Cluster]({{< ref "/platform/inference-cluster.md#cache-storage" >}}).

## Example

{{< manifests "concepts/model-cache.yaml" >}}
<!-- vale write-good.Passive = YES -->


---

# Deploying a model

Source: https://docs.modelplane.ai/getting-started/deploying-a-model/


Now that the platform is provisioned, the ML team can declare what a model needs
with a `ModelDeployment`. Describe the hardware requirements and the scheduler
schedules against the capacity the platform team published.

## Create a deployment

Create a namespace for the model:

```bash
kubectl create namespace ml-team
```

The device selector matches against the capacity declared in the
`InferenceClass`, not the pod's resource requests. Any L4 node satisfies
`>= 20Gi`, so this deployment runs on the cluster you just added:

{{< tabs >}}
{{< tab "EKS" >}}
{{< manifests "getting-started/eks/model-deployment.yaml" >}}
{{< /tab >}}
{{< tab "GKE" >}}
{{< manifests "getting-started/gke/model-deployment.yaml" >}}
{{< /tab >}}
{{< /tabs >}}

Wait until `REPLICAS` shows `1`:

```bash
kubectl get md -n ml-team --watch
```

To see which cluster the scheduler chose:

```bash
kubectl get modelreplica -n ml-team
```

```shell{nocopy=true}
NAME              CLUSTER       SYNCED   READY   COMPOSITION                   AGE
qwen-demo-7323a   eks-us-east   True     True    modelreplicas.modelplane.ai   12m
```

The ML team never named a cluster. The scheduler matched the GPU requirement
(`>= 20Gi`) against the `InferenceClass` the platform team published and made
the placement. 

## Expose the model

A `ModelService` selects `ModelEndpoints` by label and creates a Gateway API
`HTTPRoute` that routes to them. Modelplane creates one `ModelEndpoint` per
replica, labeled with the deployment name:

{{< manifests "getting-started/model-service.yaml" >}}

The request path is `/<namespace>/<modelservice-name>/...` (`/ml-team/qwen/` in
this example), from the `ModelService` named `qwen`. The `model` field in the
request body is the Hugging Face id `Qwen/Qwen2.5-0.5B-Instruct`, since this
deployment doesn't set `--served-model-name`.

## Send a request

Read the endpoint's public address from the `ModelService` status:

```bash
ADDRESS=$(kubectl get ms qwen -n ml-team -o jsonpath='{.status.address}')
```

Send a request to it:

```bash
kubectl run -i --rm curl-test \
  --image=curlimages/curl \
  --restart=Never \
  --env="ADDRESS=$ADDRESS" \
  -- sh -c 'curl -v "$ADDRESS/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d "{\"model\":\"Qwen/Qwen2.5-0.5B-Instruct\",\"messages\":[{\"role\":\"user\",\"content\":\"What is Kubernetes in one sentence?\"}],\"max_tokens\":100}"'
```

The request routes to the replica on the cluster Modelplane placed it on.
You should get a response in a few seconds:

```json {nocopy=true}
{
  "id": "chatcmpl-c88b1429-067d-40a5-971c-ab9c54153c26",
  "model": "Qwen/Qwen2.5-0.5B-Instruct",
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "Kubernetes (K8s) is an open-source platform for automating 
        the deployment, scaling, and management of containerized applications. 
        It provides scalable orchestration capabilities that enable developers 
        to deploy complex applications quickly and efficiently across various environments."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 37,
    "completion_tokens": 48,
    "total_tokens": 85
  }
}

```

## Next step

The platform team declared capacity and in this guide the ML team deployed a
model behind a stable endpoint. Neither team needed to know what the other was doing. Modelplane matched them.

In the next step, the platform team grows the fleet. [Scale the platform]({{< ref "getting-started/scale-the-platform.md" >}}) to add more clusters across regions.


---

# Kimi-K2

Source: https://docs.modelplane.ai/examples/kimi-k2/

<!-- vale write-good.Passive = NO -->
A 1T MoE (1 trillion parameters) served prefill/decode disaggregated across two
H200 nodes: two engines, one per phase, with Modelplane composing the llm-d
routing layer between them. This recipe serves an INT4 quantization of the
model; the native FP8 weights need four such nodes.

This recipe was run end to end; the `InferenceClass` and `ModelDeployment` are
the exact manifests from that run. Apply the platform side first, then the ML
side. The `InferenceCluster` carries an EC2 capacity reservation placeholder to
edit before applying.

## Platform

{{< manifests "examples/kimi-k2/inference-class.yaml" >}}

{{< manifests path="examples/kimi-k2/inference-cluster.yaml" apply="false" >}}

{{< editCode >}}
```bash
curl -fsSL {{< manifest-url "examples/kimi-k2/inference-cluster.yaml" >}} \
  | sed 's/cr-0123456789abcdef0/$@<your-reservation-id>$@/' \
  | kubectl apply -f -
```
{{< /editCode >}}

## Deployment

{{< manifests "examples/kimi-k2/model-cache.yaml" >}}

{{< manifests "examples/kimi-k2/model-deployment.yaml" >}}

{{< manifests "examples/kimi-k2/model-service.yaml" >}}
<!-- vale write-good.Passive = YES -->


---

# Register a Cluster

Source: https://docs.modelplane.ai/platform/inference-cluster/

**API:** [`modelplane.ai/v1alpha1` · InferenceCluster]({{< ref "/reference/inferenceclusters" >}})
<!-- vale write-good.Passive = NO -->
An `InferenceCluster` represents a Kubernetes cluster configured for model
serving. Platform teams create these to provide GPU capacity.


Each cluster has:

- A **cluster source**: `GKE` or `EKS` (Modelplane provisions the full cluster)
  or `Existing` (bring a cluster you manage yourself). See
  [Supported Providers]({{< ref "platform/providers.md" >}}) for the clouds and
  neoclouds Modelplane runs on.
- One or more **node pools**, each referencing an `InferenceClass` for its
  hardware capabilities and provisioning recipe.
- **Labels** for organizational metadata: tier, region, provider. These are the
  matching surface for `ModelDeployment.clusterSelector`.

Modelplane installs the serving stack it needs on every cluster it manages,
including existing clusters, which it assumes are solely for its use.

## Ownership and requirements

Modelplane assumes exclusive ownership of every `InferenceCluster`. The fleet
scheduler's capacity accounting relies on Modelplane being the only thing placing
GPU workloads on the cluster, so dedicate each cluster to Modelplane rather than
sharing it with other workloads.

Modelplane also has opinions about how a cluster is set up: its Kubernetes
version, the components it installs, and required features like DRA for binding
GPUs to pods. On provisioned clusters Modelplane handles this for you. On an
existing cluster the platform team must meet the requirements.

## Provisioned and existing clusters

The `cluster.source` discriminator picks one of two models:

- **Provisioned (`GKE`, `EKS`).** Modelplane creates the cluster and its GPU node
  pools from each pool's `InferenceClass`, labels the pool's nodes so the
  scheduler's placement is enforced, and provisions the storage class for model
  weights. It also injects a non-GPU **system pool** with opinionated defaults to
  run the inference stack, so you only declare the GPU pools you want.
- **Existing (`Existing`).** A kubeconfig `Secret` provides access to a cluster
  you run yourself. Modelplane installs the serving stack it needs but doesn't
  provision infrastructure, and each pool's `InferenceClass` provides hardware
  capabilities for scheduling only. You're responsible for the cluster meeting
  Modelplane's requirements, including labeling each pool's nodes
  `modelplane.ai/pool=<pool-name>` (see
  [how scheduling pins placement]({{< ref "/architecture/scheduling.md#pinning-placement-to-a-pool" >}})).

## Examples

{{< tabs >}}
{{< tab "GKE" >}}
{{< manifests path="concepts/inference-cluster-gke.yaml" apply="false" >}}
{{< /tab >}}
{{< tab "EKS" >}}
{{< manifests path="concepts/inference-cluster-eks.yaml" apply="false" >}}
{{< /tab >}}
{{< tab "Existing" >}}
{{< manifests path="concepts/inference-cluster-existing.yaml" apply="false" >}}
{{< /tab >}}
{{< /tabs >}}

## Cache storage

A [ModelCache]({{< ref "/models/model-cache.md" >}}) stages model weights on a
`ReadWriteMany` (RWX) StorageClass on the workload cluster. Where that comes from
depends on the source:

<!-- vale Google.Acronyms = NO -->
- **`GKE`** (Filestore Enterprise) and **`EKS`** (EFS): auto-provisioned. Those
  classes are fixed; nothing for the admin to do.
- **`Existing`**: bring your own. Create an RWX StorageClass on the cluster, with
  any backend that supports automatic PVC provisioning (WekaIO, NetApp Trident,
  `FSx` for NetApp, and similar), and name it in
  `cluster.existing.cache.storageClassName`.
<!-- vale Google.Acronyms = YES -->

The ML team's `ModelCache` and `ModelDeployment` specs are the same regardless of
which backing storage a cluster uses.
<!-- vale write-good.Passive = YES -->


---

# FAQ

Source: https://docs.modelplane.ai/overview/faq/

<!-- vale write-good.TooWordy = NO -->
<!-- vale write-good.Passive = NO -->
Short answers to the questions that come up first, with links to the full
treatment. If you're new here, read the [Introduction]({{< ref "/overview" >}})
and [How Modelplane works]({{< ref "/overview/how-it-works" >}}) first.

## What Modelplane is

{{< qa "Is Modelplane a serving engine like vLLM?" >}}
No, Modelplane is the control plane *above* the engine. It composes serving
engines like vLLM, SGLang, and NVIDIA TensorRT-LLM, and operates them across a
fleet of clusters. It doesn't serve tokens itself. You bring the engine; Modelplane schedules
it, routes to it, scales it, and caches its weights across your inference fleet.
{{< /qa >}}

{{< qa "Does Modelplane replace vLLM or SGLang?" >}}
No, they run the model; Modelplane runs the fleet. A `ModelDeployment` carries
your engine container and its flags, and Modelplane composes it onto the right
cluster. Switching or upgrading engines is a change to your deployment, not to
Modelplane.
{{< /qa >}}

{{< qa "How is Modelplane different from KServe or NVIDIA Dynamo?" >}}
Scope. KServe and Dynamo are cluster orchestrators: they schedule, scale, route,
and cache within a single Kubernetes cluster. Modelplane runs its operations across a
fleet of clusters, clouds, and regions. Modelplane uses llm-d for multi-node serving, 
and KV-cache management, as do KServe and Dynamo. Modelplane is planning deeper integrations
with NVIDIA Dynamo in future releases.
{{< /qa >}}

{{< qa "How is Modelplane different from a managed provider like Baseten or Fireworks?" >}}
Managed providers run fleet-scale serving inside their own closed platform.
Modelplane is the open equivalent that runs in infrastructure you own. The
difference is open, in your own infrastructure, community-driven, and neutral
across the stack, not scope. You can still route to a managed provider from Modelplane.
{{< /qa >}}

## What it supports

{{< qa "What models does Modelplane support?" >}}
Modelplane supports any model, including open weights, custom models, and just about
anything that can be downloaded from Hugging Face, NVIDIA NGC, and other registries.
{{< /qa >}}

{{< qa "Does Modelplane support NVIDIA?" >}}
Yes, across the stack. NVIDIA is the most widely available accelerator on the
clouds Modelplane runs on and the primary target today. Modelplane binds NVIDIA
GPUs to pods through Dynamic Resource Allocation (DRA), matching devices by
attributes such as GPU memory and architecture with CEL selectors.

The software stack rides on the engine-agnostic API. NVIDIA NIM microservices and
the TensorRT-LLM engine run as engine containers like any other, Modelplane stages
weights and NIM-style artifacts from NVIDIA NGC alongside Hugging Face and other
registries, and the inference stack it installs includes NVIDIA Dynamo and llm-d,
with deeper Dynamo integration on the roadmap.
{{< /qa >}}

{{< qa "Which engines and accelerators are supported?" >}}
The API is engine-agnostic: any engine that runs as a container works, and its
flags are yours to write. Multiple accelerators are supported as long as they
can be bound through DRA, and the device model (DRA plus CEL selectors) is built to
extend to other accelerators and fabrics.
{{< /qa >}}

{{< qa "Which clouds or neoclouds does Modelplane support?" >}}
Today Modelplane provisions clusters on a few hyperscalers and neoclouds, and supports
bringing your own Kubernetes cluster anywhere. More provisioners are on the roadmap; the
bring-your-own path means you can run on any Kubernetes now. See
[Supported Providers]({{< ref "platform/providers.md" >}}) for the full matrix of clouds,
neoclouds, and their Crossplane providers.
{{< /qa >}}

{{< qa "Can I bring my own cluster, or run on a neocloud or on-premise?" >}}
Yes, an `InferenceCluster` with `source: Existing` registers a cluster you already
run, through its kubeconfig. Modelplane installs the serving stack it needs but
doesn't provision the infrastructure. This is how you run on neoclouds and
on-premise today.
{{< /qa >}}

## What it requires

{{< qa "Where does Modelplane run?" >}}
Modelplane runs as a control plane on a control cluster: an ordinary Kubernetes
cluster with Crossplane installed, with no GPUs of its own. The inference clusters
it manages do the serving, and each needs Dynamic Resource Allocation (DRA,
Kubernetes v1.35+) to bind GPUs to pods. Modelplane assumes exclusive ownership of
every inference cluster, so dedicate each one to Modelplane rather than sharing it
with other workloads.
{{< /qa >}}

<!-- vale ai-tells.FormalRegister = NO -->
{{< qa "Do I need Crossplane?" >}}
Yes, Modelplane is built on [Crossplane](https://crossplane.io) and requires it. If your 
platform team already runs Crossplane to manage cloud infrastructure, Modelplane is the 
same pattern applied to inference. Modelplane uses Crossplane's function framework and shares its infrastructure providers.
{{< /qa >}}
<!-- vale ai-tells.FormalRegister  = YES -->


## What it can do

{{< qa "How does Modelplane decide where a model runs?" >}}
Two-level matching. First it filters clusters by their labels (tier, region,
provider) against your `clusterSelector`. Then it filters node pools by matching
your device requests, real DRA requests with CEL selectors over GPU memory,
architecture, and other attributes, against each pool's `InferenceClass`. It places each
replica on a cluster and pool that fits and has free capacity.
{{< /qa >}}

{{< qa "Can I serve across regions and clusters behind one endpoint?" >}}
Yes, that's the point. A `ModelService` exposes one OpenAI-compatible endpoint and
load-balances across every replica of a deployment, wherever they run.
{{< /qa >}}

{{< qa "Can I route to a managed provider?" >}}
Yes, a `ModelService` can include a manually created `ModelEndpoint` that points at
an external SaaS endpoint like Together or Baseten alongside your self-hosted
replicas, and load-balances across all of them.
{{< /qa >}}

{{< qa "How do large or multi-node models work?" >}}
An engine can be a gang: a leader and one or more workers that Modelplane composes
into a LeaderWorkerSet across nodes. You write the coordination (like Ray or vLLM's data-parallel coordinator) in the engine flags, and Modelplane injects
the leader's address so the workers can join it. Multi-node deployments stage
weights through a `ModelCache`.
{{< /qa >}}

{{< qa "What about disaggregated prefill/decode?" >}}
Set `serving.mode: PrefillDecode` and define separate prefill and decode engines.
Both run on the same cluster, hand off the KV cache over a fast fabric, and
Modelplane configures the cluster-edge routing that pairs each request. The
KV-transfer flags live in your engine config.
{{< /qa >}}

{{< qa "How does scaling work?" >}}
Replicas are the only scaling axis. Each replica is a complete serving instance;
scaling `spec.replicas` adds or removes whole instances across the fleet. Because
a `ModelDeployment` exposes the Kubernetes scale subresource, `kubectl scale` and
KEDA work without anything extra. There's no per-pod autoscaling inside a cluster.
{{< /qa >}}

{{< qa "How are model weights handled?" >}}
A `ModelCache` stages weights once per cluster on shared (ReadWriteMany) storage,
and every pod reads them locally. Pods don't re-download on each start, and
concurrent starts don't race. It hydrates from Hugging Face today, is optional for
single-node deployments, and is recommended for multi-node ones.
{{< /qa >}}

## The project

{{< qa "Why did you pick Modelplane as a name for the project?" >}}
It's a fusion of AI Model and Control Plane. We also like that it implies that AI models
are their own layer (or plane) in the stack.
{{< /qa >}}

{{< qa "What does the logo signify?" >}}
Three popsicle sticks assembled to make a model plane. Balsa wood planes were the inspiration.
{{< /qa >}}

{{< qa "Is Modelplane production-ready?" >}}
Modelplane is in early development and moving fast. Treat it as early software. The
[platform docs]({{< ref "/platform" >}}) are specific about what's available today
versus what's planned. We are building it in the open.
{{< /qa >}}

{{< qa "What's the license and governance?" >}}
Modelplane is [Apache 2.0](https://github.com/modelplaneai/modelplane/blob/main/LICENSE),
with no usage caps or token metering, and is developed in the open. It's neutral
across models, engines, accelerators, and clouds, and is intended for donation to
a neutral open source foundation. It's a project from Upbound, the team behind Rook
and Crossplane, both CNCF Graduated and widely adopted projects.
{{< /qa >}}

{{< qa "How do I get involved?" >}}
Issues, discussions, and contributions are welcome on
[GitHub](https://github.com/modelplaneai/modelplane). See `CONTRIBUTING.md` for
development setup and the project's conventions.
{{< /qa >}}

## Next steps

{{< cardgroup cols="2" >}}
{{< card title="Get started" href="/getting-started/" >}}
Deploy Modelplane and serve your first model.
{{< /card >}}
{{< card title="How Modelplane works" href="/overview/how-it-works/" >}}
The architecture and the control loop, in one page.
{{< /card >}}
{{< /cardgroup >}}
<!-- vale write-good.TooWordy = YES -->
<!-- vale write-good.Passive = YES -->


---

# Glossary

Source: https://docs.modelplane.ai/overview/glossary/


## Modelplane

The open source control plane software. You install Modelplane on a Kubernetes
cluster (the **control cluster**). Modelplane never serves tokens itself; it
orchestrates the clusters and engines that do.

## Control cluster

The Kubernetes cluster where Modelplane runs. It needs no GPUs. It holds
Modelplane's Crossplane-based components and the API resources you apply to
declare your fleet.

## Inference cluster

A GPU cluster in the fleet where serving engines run and tokens are produced.
Modelplane can provision inference clusters on EKS, GKE, and other providers, or
you can bring your own through an `InferenceCluster` with `source: Existing`.

## Fleet

All inference clusters managed by a single Modelplane control cluster.

## Platform

The inference infrastructure the platform team
provisions using `InferenceGateway`, `InferenceClass`, and `InferenceCluster`
resources. This is distinct from Modelplane itself, which runs on the control
cluster above the fleet.

## Platform team

The infrastructure team responsible for GPU capacity. They create
`InferenceCluster`, `InferenceClass`, and `InferenceGateway` resources,
provisioning the fleet that ML teams deploy against.

<!-- vale Google.Headings = NO -->
<!-- vale Microsoft.HeadingAcronyms = NO -->
## ML team
<!-- vale Google.Headings = YES -->
<!-- vale Microsoft.HeadingAcronyms = YES -->

The development team deploying models. They create `ModelDeployment`,
`ModelService`, and `ModelCache` resources, declaring what a model needs without
knowing which cluster it runs on.


---

# AI tools

Source: https://docs.modelplane.ai/overview/ai-tools/

<!-- vale write-good.TooWordy = NO -->
The Modelplane docs are built to be read by AI assistants as well as people. You
can connect a coding agent directly to this site, pull any page as Markdown, or
point a model at a single index file that lists the whole documentation set.
Every page also carries a **Copy page** menu next to its title with the same
shortcuts.

## Connect to the MCP server

The documentation MCP server lets an assistant search these docs and read any
page in real time, so its answers track the current content instead of its
training data. It exposes two tools:

- `search_modelplane_docs`: search the docs and get back the most relevant sections with their titles, URLs, and snippets.
- `get_modelplane_doc`: fetch the full Markdown of a single page.

The server URL is:

```plaintext
https://docs.modelplane.ai/mcp
```

{{< tabs >}}
{{< tab "Claude Code" >}}
```bash
claude mcp add --transport http modelplane-docs https://docs.modelplane.ai/mcp
```
{{< /tab >}}
{{< tab "Claude Desktop" >}}
Open Settings, go to Connectors, and choose **Add custom connector**. Name it `modelplane-docs`, enter the server URL above, and enable the connector when you start a conversation.
{{< /tab >}}
{{< tab "Cursor" >}}
<!-- vale Google.Colons = NO -->
Open the command palette, run **Cursor Settings: MCP**, and add a server to `mcp.json`:
<!-- vale Google.Colons = YES -->

```json
{
  "mcpServers": {
    "modelplane-docs": {
      "url": "https://docs.modelplane.ai/mcp"
    }
  }
}
```
{{< /tab >}}
{{< tab "VS Code" >}}
Create `.vscode/mcp.json` in your workspace:

```json
{
  "servers": {
    "modelplane-docs": {
      "type": "http",
      "url": "https://docs.modelplane.ai/mcp"
    }
  }
}
```
{{< /tab >}}
{{< tab "Other" >}}
Any MCP client that speaks the streamable HTTP transport can connect to the server URL directly. No authentication is required.
{{< /tab >}}
{{< /tabs >}}

The **Copy page** menu on every page also has **Connect to Cursor** and **Connect to VS Code** shortcuts that install the server in one click.

## Read pages as Markdown

Every page is also published as raw Markdown. Add `index.md` to any page URL:

```plaintext
https://docs.modelplane.ai/models/model-deployment/index.md
```

The **Copy page** control next to each title copies that Markdown to your clipboard, and **View as Markdown** opens it in the browser. Paste it into any assistant when you want to ground a question in a specific page.

## llms.txt

For tools that index a whole site, the docs publish the [`llms.txt`](https://llmstxt.org) format:

- [`llms.txt`](/llms.txt): a short index of every page with links and descriptions.
- [`llms-full.txt`](/llms-full.txt): every page concatenated into one Markdown file.

## Page menu reference

The **Copy page** menu next to each title has these actions:

{{< table >}}
| Action | What it does |
|---|---|
| Copy page | Copies the page as Markdown to your clipboard. |
| View as Markdown | Opens the page as raw Markdown. |
| Copy MCP Server | Copies the MCP server URL to your clipboard. |
| Connect to Cursor | Installs the MCP server in Cursor. |
| Connect to VS Code | Installs the MCP server in VS Code. |
{{< /table >}}

<!-- vale write-good.TooWordy = YES -->


---

# Architecture

Source: https://docs.modelplane.ai/architecture/

<!-- vale write-good.Passive = NO -->
Modelplane's central design choice is to build the control plane on
[Crossplane](https://crossplane.io) rather than as a bespoke set of Kubernetes
controllers. Everything else here follows from that. This section assumes you're
comfortable with Kubernetes; the rest of the Crossplane vocabulary you need is
below.

## Crossplane in brief

[Crossplane](https://crossplane.io) extends Kubernetes to manage things beyond
the cluster, cloud infrastructure, SaaS, and in Modelplane's case inference
fleets, through the same declarative, reconciled API model. Three of its concepts
matter here:

- **Composite Resources (XRs)** are custom resources whose controller, instead of
  talking to an external API directly, declares a set of other resources that
  should exist. Every Modelplane API, `InferenceCluster`, `ModelDeployment`,
  `ModelService`, is an XR.
- **Composition functions** are that controller logic. A function is a small gRPC
  service handed the observed XR and the resources it depends on, which returns
  the desired child resources. An XR runs a pipeline of one or more functions
  every reconcile; in Modelplane each is typically a single function, so the rest
  of this section says "the function" for short.
- **Providers** are controllers that manage external systems through their own
  managed resources: `provider-gcp` and `provider-aws` for cloud APIs,
  `provider-helm` for Helm releases, `provider-kubernetes` for arbitrary objects
  on any cluster. A composition function composes these like any other resource.

Put together: a Modelplane API is an XR, its logic is a composition function, and
the function composes a mix of plain Kubernetes objects, other Modelplane XRs, and
provider resources.

The resource model mirrors Kubernetes core, one scope up:
`ModelDeployment` → `ModelReplica` → `ModelService` → `ModelEndpoint` parallels
`Deployment` → `Pod` → `Service` → `Endpoint`, but across a fleet of clusters
rather than within one. A `ModelDeployment` composes a `ModelReplica` per replica,
a `ModelReplica` composes the serving workload on its target cluster, and a
`ModelService` routes across the `ModelEndpoint`s. If you know how those core
objects relate, you already know the shape of Modelplane's.

## Why Crossplane?

Modelplane is, at its core, a system that turns declarative resources into
composed infrastructure spanning cloud accounts, many Kubernetes clusters, and
the workloads on them. That's the problem Crossplane solves, and it helps in two
ways: providers and functions.

**Providers** give us reach. Modelplane has to provision Kubernetes clusters and
all the infrastructure they need across different clouds, then install software
onto them. That's an enormous surface, and providers cover it without us rolling
our own controllers for each cloud API and Helm release.

**Functions** are where Modelplane's own logic lives, and writing it as
composition functions buys several things:

- **Business logic, not controller plumbing.** A function computes desired state
  from observed state. Crossplane handles the fiddly Kubernetes controller
  details, the watches, requeues, finalizers, and drift correction, that a
  hand-written controller gets wrong in a dozen subtle ways. Less plumbing to
  write and maintain means we move faster.
- **Testability.** A function is a pure function of its inputs, so you can test
  it as a black box: feed it an XR and its dependencies, assert on the resources
  it returns. The whole test runs in process, with no API server to stand up.
- **The right language for each job.** Functions can be written in any language.
  Modelplane's are Python, for fast iteration on the serving and scheduling logic
  and because Python is the common language of the ML world, which lowers the bar
  for contributors. The performance-sensitive distributed-systems core stays in
  Go, where Crossplane and its providers already are.

The bet underneath both is that inference infrastructure is the same shape of
problem as cloud infrastructure, which Crossplane manages well. Building on it
lets Modelplane spend its effort on the part that's actually inference-specific.

## The control cluster and the fleet

Modelplane runs on a **control cluster** and manages a fleet of **workload
clusters**, the `InferenceCluster`s. The split is deliberate: the control plane
holds no GPUs and serves no tokens. It schedules, composes, and routes; the
workload clusters do the serving.

The control cluster runs Crossplane, the Modelplane composition functions (one
per resource, each a pod Crossplane calls per reconcile), the providers, and the
control-plane gateway. It also holds every Modelplane resource and the
`ProviderConfig`s that let the providers reach each workload cluster, built from
that cluster's kubeconfig.

Crossplane core drives everything. Each reconcile it asks a function what a
resource should compose and gets back the desired resources. Core then reconciles
them, applying the provider resources that the providers act on. A function only
computes desired state. It never reaches a provider or a cluster itself.

```mermaid
flowchart TB
    subgraph control["Control cluster"]
        cp["Crossplane core"]
        fns["Modelplane functions\n(one pod per resource)"]
        prov["Providers\ngcp · aws · helm · kubernetes"]
        gw["Control-plane gateway"]
    end
    subgraph fleet["Fleet"]
        wc1["Workload cluster A"]
        wc2["Workload cluster B"]
    end
    cp <-->|"desired state (gRPC)"| fns
    cp -->|composes| prov
    cp -->|composes| gw
    prov -->|provision + install via kubeconfig| wc1
    prov -->|provision + install via kubeconfig| wc2
```

Modelplane installs a serving stack on each workload cluster: the components a
cluster needs to serve models, providing inference-aware routing through Gateway
API, multi-node serving, GPU binding through DRA, and observability, among others.
The exact components evolve, but Modelplane composes and owns all of them. For
provisioned clusters the providers also create the cluster and its node pools
first.

## How a deployment is composed

A resource composes others, which compose others, until the tree bottoms out in
provider resources and plain Kubernetes objects. A `ModelDeployment` is the
clearest example. Its function schedules the replicas, then composes a
`ModelReplica` for each, and a `ModelEndpoint` for each replica that's ready to
serve. Each `ModelReplica` function composes the serving workload, a Deployment or
a LeaderWorkerSet, onto its target workload cluster through provider-kubernetes.

```mermaid
flowchart TD
    md["ModelDeployment"]
    mr1["ModelReplica\n(cluster A)"]
    mr2["ModelReplica\n(cluster B)"]
    me1["ModelEndpoint\n(cluster A)"]
    me2["ModelEndpoint\n(cluster B)"]
    wl1["Deployment / LeaderWorkerSet\non workload cluster A"]
    wl2["Deployment / LeaderWorkerSet\non workload cluster B"]

    md --> mr1
    md --> mr2
    md --> me1
    md --> me2
    mr1 --> wl1
    mr2 --> wl2
```

The platform resources compose the same way. An `InferenceCluster` composes a
`GKECluster` or `EKSCluster` (the cloud infrastructure, via the cloud providers)
and a `ServingStack` (the per-cluster software install, via provider-helm and
provider-kubernetes). Engines bind GPUs through DRA: each `claim: DRA` device in a
member's `nodeSelector` becomes a request in the `ResourceClaim` the serving pods
claim through.

## The request path

A served request crosses two gateways, both built on Gateway API. The
**control-plane gateway** is the front door: a `ModelService` composes an
`HTTPRoute` on it that matches the service's path prefix and forwards to the
matched `ModelEndpoint`s, each of which is a `Service` pointing at a workload
cluster's gateway address. The **workload-cluster gateway** then routes from the
cluster edge to the engine pods.

```mermaid
flowchart LR
    client["Client"]
    cpgw["Control-plane gateway"]
    wcgw["Workload-cluster gateway"]
    engine["Engine pods\n(vLLM, SGLang, ...)"]

    client -->|service path| cpgw
    cpgw -->|per-replica path| wcgw
    wcgw -->|engine path| engine
```

Each hop rewrites the path: the control plane rewrites the public prefix to the
replica's path, and the workload gateway strips that down to what the engine
serves. This per-backend path rewriting is the main thing the control-plane
gateway has to support, and it narrows which Gateway API implementations can fill
the role.

Which gateway sits at each layer is internal, not part of the API. The
[`InferenceGateway`]({{< ref "/platform/inference-gateway.md" >}}) `backend` field
is an enum precisely so the control-plane gateway can grow other options over
time. Target the `ModelService` URL rather than either gateway directly.
<!-- vale write-good.Passive = YES -->


---

# Llama-3.1-8B

Source: https://docs.modelplane.ai/examples/llama-3.1-8b/

<!-- vale write-good.Passive = NO -->
An 8B dense chat model on a single NVIDIA L4. The entry recipe: one `Standalone`
engine, no cache, public weights from a Hugging Face mirror. It carries no
`clusterSelector`, so device capacity alone matches it to any compatible L4 in
the fleet.

This recipe was run end to end on GKE; the `InferenceClass`, `InferenceCluster`,
and `ModelDeployment` are the exact manifests from that run. The EKS platform
shape is the standard single-L4 recipe. It passes server validation but was not
served in this run. Apply the platform side first, then the ML side. The GKE
`InferenceCluster` carries a GCP project placeholder to edit before applying.

## Platform

{{< tabs >}}
{{< tab "EKS" >}}
{{< manifests "examples/llama-3.1-8b/inference-class-eks.yaml" >}}

{{< manifests "examples/llama-3.1-8b/inference-cluster-eks.yaml" >}}
{{< /tab >}}
{{< tab "GKE" >}}
{{< manifests "examples/llama-3.1-8b/inference-class-gke.yaml" >}}

{{< manifests path="examples/llama-3.1-8b/inference-cluster-gke.yaml" apply="false" >}}

{{< editCode >}}
```bash
curl -fsSL {{< manifest-url "examples/llama-3.1-8b/inference-cluster-gke.yaml" >}} \
  | sed 's/my-gcp-project/$@<your-gcp-project-id>$@/' \
  | kubectl apply -f -
```
{{< /editCode >}}
{{< /tab >}}
{{< /tabs >}}

## Deployment

{{< manifests "examples/llama-3.1-8b/model-deployment.yaml" >}}

{{< manifests "examples/llama-3.1-8b/model-service.yaml" >}}
<!-- vale write-good.Passive = YES -->


---

# Route to External Providers

Source: https://docs.modelplane.ai/models/model-endpoint/

**API:** [`modelplane.ai/v1alpha1` · ModelEndpoint]({{< ref "/reference/modelendpoints" >}})
<!-- vale write-good.Passive = NO -->
A `ModelEndpoint` is a single reachable inference endpoint that a
[`ModelService`]({{< ref "model-service.md" >}}) can route to. Modelplane creates
one for each of your replicas automatically, but you can also create one by hand
to point at an inference endpoint Modelplane doesn't run, most often a SaaS
provider like Together or Baseten. A service treats both the same, so you can
front your own replicas and an external provider behind one URL: send overflow to
the provider when your fleet is busy, or fail over to it as a break-glass option.

## Routing to an external provider

Create a `ModelEndpoint` with three things:

```yaml {nocopy=true}
apiVersion: modelplane.ai/v1alpha1
kind: ModelEndpoint
metadata:
  name: kimi-k2-together
  namespace: ml-team
  labels:
    # 1. A label of your own for a ModelService to select on. Any label
    #    works; modelplane.ai/external-provider is a readable convention.
    modelplane.ai/external-provider: together
spec:
  # 2. The provider's base URL.
  url: https://api.together.xyz/
  # 3. The path to rewrite requests to. A ModelService receives requests at
  #    /<namespace>/<service>/v1/... and rewrites them to this prefix, so an
  #    OpenAI-compatible provider that serves /v1/... takes /v1/.
  rewritePath: /v1/
```

Then point a [`ModelService`]({{< ref "model-service.md" >}}) at it. Selecting
`modelplane.ai/external-provider: together` routes to the provider; adding a
second entry for a deployment fronts both behind one URL, so traffic can spill
over to the provider alongside your own replicas:

```yaml {nocopy=true}
apiVersion: modelplane.ai/v1alpha1
kind: ModelService
metadata:
  name: kimi-k2
  namespace: ml-team
spec:
  endpoints:
  - selector:
      matchLabels:
        modelplane.ai/deployment: kimi-k2          # your own replicas
  - selector:
      matchLabels:
        modelplane.ai/external-provider: together  # the endpoint above
```

The provider must speak the OpenAI API, since that's the contract a
`ModelService` exposes. Anything OpenAI-compatible works; `url` and `rewritePath`
are all that change between providers.
<!-- vale write-good.Passive = YES -->

## Example

{{< manifests "concepts/model-endpoint.yaml" >}}


---

# Scale the platform

Source: https://docs.modelplane.ai/getting-started/scale-the-platform/


You have one L4 cluster with a running model. In this guide, you'll add two
larger-GPU clusters in different regions to grow the fleet available to the ML team.

Provisioning two more clusters takes about 10 to 15 minutes.

## Register more clusters

{{< tabs >}}
{{< tab "EKS" >}}
Register two more clusters with a bigger hardware class: `L40S` (`48 Gi`) in
`us-west` and `eu-central`:

{{< manifests "getting-started/eks/platform-scale.yaml" >}}

{{< hint "note" >}}
`g6e.xlarge` runs ~$2/hr on demand. Two of them plus the `L4` from earlier is a
few dollars for this tour. Clean up when you're done (see [Clean
up]({{< ref "getting-started/clean-up.md" >}})).
{{< /hint >}}
{{< /tab >}}
{{< tab "GKE" >}}
Register two more clusters with a bigger hardware class: `A100` (`40 Gi`) in
`us-west` and `us-east`. Apply the manifest, setting each cluster's `project` to
your GCP project:

{{< manifests path="getting-started/gke/platform-scale.yaml" apply="false" >}}

{{< editCode >}}
```bash
curl -fsSL {{< manifest-url "getting-started/gke/platform-scale.yaml" >}} \
  | sed 's/my-gcp-project/$@<your-gcp-project>$@/g' \
  | kubectl apply -f -
```
{{< /editCode >}}

{{< hint "note" >}}
`a2-highgpu-1g` runs ~$3.50/hr on demand. Two of them plus the `L4` from earlier
is a few dollars for this tour. Clean up when you're done (see [Clean
up]({{< ref "getting-started/clean-up.md" >}})).
{{< /hint >}}
{{< /tab >}}
{{< /tabs >}}

Modelplane provisions both clusters in parallel:

```bash
kubectl wait --for=condition=Ready ic --all --timeout=20m
```

## Your model keeps running

Growing the fleet doesn't disturb anything already deployed. `qwen-demo` stays
on its original cluster and the two new clusters add capacity the moment
they're `Ready` with no interruption for the ML team. A replica only moves if
its deployment changes in a way that no longer fits where it runs. 

## Next step

The fleet now spans three clusters across three regions. The ML team is next. [Scale the model]({{< ref "getting-started/scale-the-model.md" >}}) to serve it from two regions behind a single endpoint.


---

# Supported Providers

Source: https://docs.modelplane.ai/platform/providers/

Modelplane is built on [Crossplane](https://crossplane.io) and shares its
infrastructure providers, so the set of clouds and neoclouds it reaches grows
alongside Crossplane itself. This page shows where Modelplane runs today and
where it's headed.

A provider can show up here in three ways:

{{< hint "note" >}}
- **Provisioning supported.** Modelplane creates and manages the whole cluster
  from an `InferenceCluster`, selected through `provisioning.provider`. GKE and
  EKS work this way today.
- **Bring your own supported.** Register a cluster you already run with
  `source: Existing`. This works on any provider whose Kubernetes meets
  Modelplane's requirements (Dynamic Resource Allocation and a recent Kubernetes
  version), so you can run on the providers below now, ahead of native
  provisioning.
- **Crossplane provider exists.** A Crossplane provider is published for the
  cloud. That provider is the path by which native provisioning lands, so it
  marks where Modelplane can grow next.
{{< /hint >}}

## Clouds and neoclouds

Listed alphabetically, spanning hyperscalers and GPU-specialist neoclouds. Each
runs a managed Kubernetes service with GPU node pools, so the bring-your-own path
covers them all today. Where a Crossplane provider exists, it's the path to
native provisioning.

{{< table >}}
| Provider / service | Accelerators | Provisioning | BYO | Crossplane |
|---|---|---|---|---|
| Alibaba Cloud (ACK) | {{< accel nvidia >}} | Planned | ✓ | {{< repolink "https://github.com/crossplane-contrib/provider-upjet-alibabacloud" "provider-upjet-alibabacloud" "community" >}} |
| AWS (EKS) | {{< accel nvidia >}} {{< accel trainium >}} | ✓ | ✓ | {{< repolink "https://github.com/crossplane-contrib/provider-upjet-aws" "provider-upjet-aws" "community" >}} |
| Civo (K3s) | {{< accel nvidia >}} | Planned | ✓ | {{< repolink "https://github.com/crossplane-contrib/provider-civo" "provider-civo" "community" >}} |
| CoreWeave (CKS) | {{< accel nvidia >}} | Planned | ✓ | none yet |
| Crusoe (CMK) | {{< accel nvidia >}} {{< accel amd >}} | Planned | ✓ | none yet |
| DigitalOcean (DOKS) | {{< accel nvidia >}} {{< accel amd >}} | Planned | ✓ | {{< repolink "https://github.com/crossplane-contrib/provider-upjet-digitalocean" "provider-upjet-digitalocean" "community" >}} |
| Fluidstack | {{< accel nvidia >}} | Planned | ✓ | none yet |
| Google Cloud (GKE) | {{< accel nvidia >}} {{< accel tpu >}} | ✓ | ✓ | {{< repolink "https://github.com/crossplane-contrib/provider-upjet-gcp" "provider-upjet-gcp" "community" >}} |
| Huawei Cloud (CCE) | {{< accel nvidia >}} {{< accel ascend >}} | Planned | ✓ | {{< repolink "https://github.com/huaweicloud/provider-huaweicloud" "provider-huaweicloud" "alpha" >}} |
| IBM Cloud (IKS) | {{< accel nvidia >}} | Planned | ✓ | none active |
| Lambda | {{< accel nvidia >}} | Planned | ✓ | none yet |
| Linode / Akamai (LKE) | {{< accel nvidia >}} | Planned | ✓ | {{< repolink "https://github.com/linode/provider-linode" "provider-linode" "official" >}} |
| Microsoft Azure (AKS) | {{< accel nvidia >}} | Planned | ✓ | {{< repolink "https://github.com/crossplane-contrib/provider-upjet-azure" "provider-upjet-azure" "community" >}} |
| Nebius | {{< accel nvidia >}} | Planned | ✓ | none yet |
| Oracle Cloud (OKE) | {{< accel nvidia >}} {{< accel amd >}} | Planned | ✓ | {{< repolink "https://github.com/oracle/crossplane-provider-oci" "crossplane-provider-oci" "official" >}} |
| OVHcloud | {{< accel nvidia >}} | Planned | ✓ | {{< repolink "https://github.com/edixos/provider-ovh" "edixos/provider-ovh" "community" >}} |
| Scaleway (Kapsule) | {{< accel nvidia >}} | Planned | ✓ | {{< repolink "https://github.com/scaleway/crossplane-provider-scaleway" "crossplane-provider-scaleway" "official" >}} |
| Tencent Cloud (TKE) | {{< accel nvidia >}} | Planned | ✓ | {{< repolink "https://github.com/crossplane-contrib/provider-tencentcloud" "provider-tencentcloud" "community" >}} |
| Voltage Park | {{< accel nvidia >}} | Planned | ✓ | none yet |
| Vultr (VKE) | {{< accel nvidia >}} {{< accel amd >}} | Planned | ✓ | {{< repolink "https://github.com/vultr/crossplane-provider-vultr" "crossplane-provider-vultr" "official" >}} |
{{< /table >}}

{{< hint "note" >}}
**On-premises and bare metal.** Bring an on-prem cluster the same way as any
other: stand up Kubernetes on your own hardware (like NVIDIA DGX BasePOD
or SuperPOD) with NVIDIA Base Command Manager, Run:ai, or your own tooling, then
register it with `source: Existing`. Provisioning it for you is on the roadmap
too. Modelplane can drive NVIDIA Base Command Manager or other bare-metal
Kubernetes provisioners through Crossplane, the same pattern it uses in the
cloud.
<!-- vale ai-tells.ShipOveruse = NO -->

{{< /hint >}}
Native provisioning expands as more Crossplane providers ship; until then, the
bring-your-own path runs Modelplane on any conformant Kubernetes cluster today.

{{< hint "tip" >}}
<!-- vale ai-tells.ShipOveruse = = YES -->

Don't see your cloud or neocloud, or want to be added?
[Open an issue](https://github.com/modelplaneai/modelplane/issues/new) and we'll
track it.
{{< /hint >}}

{{< cardgroup cols="2" >}}
{{< card title="Register a Cluster" href="/platform/inference-cluster/" >}}
Add a cluster to Modelplane, provisioned or bring-your-own.
{{< /card >}}
{{< card title="Define Hardware Classes" href="/platform/inference-class/" >}}
Describe the GPUs and provisioning recipe each node pool uses.
{{< /card >}}
{{< /cardgroup >}}


---

# API Reference

Source: https://docs.modelplane.ai/reference/


Modelplane's API is a set of Kubernetes custom resources. Each type below has
its own page with the full spec and status schema, a runnable example, and
fields you can link to directly. For release history, see the
[GitHub releases page](https://github.com/modelplaneai/modelplane/releases).


---

# Scale the model

Source: https://docs.modelplane.ai/getting-started/scale-the-model/

A `ModelService` can front more than one `ModelDeployment`. Here you add a second
deployment, pinned to a different region, and point the same service at both. The
endpoint you already curled stays the same. Behind it, traffic now load-balances
across two regions.

```mermaid
graph LR
    subgraph fleet ["Fleet"]
        IC1["us-east\nL4"]
        IC2["us-west\nlarger GPU"]
    end

    subgraph ml ["ML team"]
        MD1["ModelDeployment\nqwen-demo"]
        MD2["ModelDeployment\nqwen-west\nclusterSelector: us-west"]
        MS["ModelService qwen\n/ml-team/qwen/v1/..."]
    end

    IC1 --> MD1
    IC2 --> MD2
    MD1 --> MS
    MD2 --> MS
```

## Deploy to a second region

The new deployment uses a `clusterSelector` to pin its replica to the `us-west`
cluster you added in the last step, and selects the larger GPU there:

{{< tabs >}}
{{< tab "EKS" >}}
{{< manifests "getting-started/eks/model-deployment-west.yaml" >}}
{{< /tab >}}
{{< tab "GKE" >}}
{{< manifests "getting-started/gke/model-deployment-west.yaml" >}}
{{< /tab >}}
{{< /tabs >}}

Wait until its replica is `Ready`, then check placement. You now have one replica
per region:

```bash
kubectl get modelreplica -n ml-team
```

```shell {nocopy=true}
NAME              CLUSTER       SYNCED   READY   COMPOSITION                   AGE
qwen-demo-7323a   eks-us-east   True     True    modelreplicas.modelplane.ai   42m
qwen-west-92535   eks-us-west   True     True    modelreplicas.modelplane.ai   8m
```

## Front both with one service

Update the `ModelService` to select both deployments. Each entry in
`spec.endpoints` adds its matching replicas to the same endpoint:

{{< manifests "getting-started/model-service-multi.yaml" >}}

The endpoint URL doesn't change. Clients that had this URL before still have it;
they don't know the fleet changed. The gateway load-balances across both regions,
and losing one region keeps the other serving. Send the same request as before:

```bash
ADDRESS=$(kubectl get ms qwen -n ml-team -o jsonpath='{.status.address}')
```

```bash
kubectl run -i --rm curl-test \
  --image=curlimages/curl \
  --restart=Never \
  --env="ADDRESS=$ADDRESS" \
  -- sh -c 'curl -v "$ADDRESS/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d "{\"model\":\"Qwen/Qwen2.5-0.5B-Instruct\",\"messages\":[{\"role\":\"user\",\"content\":\"What is Kubernetes in one sentence?\"}],\"max_tokens\":100}"'
```

## That's the tour

You stood up a control plane, built a multi-region GPU fleet, deployed a model
across it, and ended with one stable endpoint serving requests. The platform
team published hardware. The ML team described what the model needs. Modelplane
placed them and served behind a single endpoint.

[Clean up]({{< ref "getting-started/clean-up.md" >}}) tears everything down
when you're done.

For more on the resources you used:

* [InferenceClass]({{< ref "platform/inference-class.md" >}})
* [InferenceCluster]({{< ref "platform/inference-cluster.md" >}})
* [ModelDeployment]({{< ref "models/model-deployment.md" >}})
* [ModelService]({{< ref "models/model-service.md" >}})

Modelplane is in active development and we're building in the open. If you're
running your own inference fleet and want to shape where this goes, we'd love to
hear from you. Star the [repository](https://github.com/modelplaneai/modelplane),
join us in [Slack](https://slack.crossplane.io), or read the
[manifesto](https://modelplane.ai).


---

# Clean up

Source: https://docs.modelplane.ai/getting-started/clean-up/

Delete the model resources, clusters, and finally the control plane.

## Delete model resources

Delete model resources before clusters. Deleting a cluster first leaves the
deployments reconciling against infrastructure that no longer exists.

```bash
kubectl delete md --all -n ml-team
kubectl delete ms --all -n ml-team
```

Wait for all model replicas to finish:

```bash
kubectl get modelreplica -n ml-team --watch
```

## Delete the clusters

Delete all clusters with foreground cascading deletion. The serving stack on each
workload cluster must uninstall while that cluster's API server is still
reachable. Foreground deletion holds each cluster object until its stack
finishes. Background deletion can orphan cloud resources.

```bash
kubectl delete ic --all --cascade=foreground
```

Wait until all clusters are deleted:

```bash
kubectl get ic --watch
```

## Delete the control plane

Delete the kind cluster:

```bash
kind delete cluster --name modelplane
```


---

# EKSCluster

Source: https://docs.modelplane.ai/reference/eksclusters/

An EKSCluster provisions an EKS cluster with dedicated node groups for GPU inference and system workloads. It outputs a Secret containing the cluster kubeconfig that consumers use to target the cluster. The kubeconfig embeds a static bearer token that the AWS provider refreshes.

---

# GKECluster

Source: https://docs.modelplane.ai/reference/gkeclusters/

A GKECluster provisions a GKE cluster with dedicated node pools for GPU inference and system workloads. It outputs secrets containing the cluster kubeconfig and a GCP service account key that consumers can use to target the cluster.

---

# ServingStack

Source: https://docs.modelplane.ai/reference/servingstacks/

A ServingStack installs the serving substrate (LeaderWorkerSet, Gateway API, cert-manager, Prometheus) on a Kubernetes cluster.