Overview on Modelplane Docs

Why Modelplane

Open-weight models are becoming the choice for organizations: they can be post-trained, including with reinforcement learning, to compete with frontier models, and they put cost, governance, and data sovereignty back under the organization’s control. As they do, platform teams are increasingly asked to provide GPU inference to their ML and development teams the same way they already provide cloud infrastructure.

How Modelplane works

Modelplane runs as a control plane on its own cluster, the control cluster, above the inference clusters that actually serve models. It’s built on Crossplane: platform teams and developers describe what they want as Kubernetes resources, and Modelplane continuously reconciles the fleet to match, composing the clusters, scheduling replicas, and exposing endpoints. This page is the full tour. It covers the architecture and resources, then walks through what happens when you deploy a model.

FAQ

Short answers to the questions that come up first, with links to the full treatment. If you’re new here, read the Introduction and How Modelplane works first.

What Modelplane is

Is Modelplane a serving engine like vLLM?

No, Modelplane is the control plane above the engine. It composes serving engines like vLLM, SGLang, and NVIDIA TensorRT-LLM, and operates them across a fleet of clusters. It doesn’t serve tokens itself. You bring the engine; Modelplane schedules it, routes to it, scales it, and caches its weights across your inference fleet.

Does Modelplane replace vLLM or SGLang?

No, they run the model; Modelplane runs the fleet. A ModelDeployment carries your engine container and its flags, and Modelplane composes it onto the right cluster. Switching or upgrading engines is a change to your deployment, not to Modelplane.

How is Modelplane different from KServe or NVIDIA Dynamo?

Scope. KServe and Dynamo are cluster orchestrators: they schedule, scale, route, and cache within a single Kubernetes cluster. Modelplane runs its operations across a fleet of clusters, clouds, and regions. Modelplane uses llm-d for multi-node serving, and KV-cache management, as do KServe and Dynamo. Modelplane is planning deeper integrations with NVIDIA Dynamo in future releases.

How is Modelplane different from a managed provider like Baseten or Fireworks?

Managed providers run fleet-scale serving inside their own closed platform. Modelplane is the open equivalent that runs in infrastructure you own. The difference is open, in your own infrastructure, community-driven, and neutral across the stack, not scope. You can still route to a managed provider from Modelplane.

What it supports

What models does Modelplane support?

Modelplane supports any model, including open weights, custom models, and just about anything that can be downloaded from Hugging Face, NVIDIA NGC, and other registries.

Does Modelplane support NVIDIA?

Yes, across the stack. NVIDIA is the most widely available accelerator on the clouds Modelplane runs on and the primary target today. Modelplane binds NVIDIA GPUs to pods through Dynamic Resource Allocation (DRA), matching devices by attributes such as GPU memory and architecture with CEL selectors.

Glossary

Modelplane

The open source control plane software. You install Modelplane on a Kubernetes cluster (the control cluster). Modelplane never serves tokens itself; it orchestrates the clusters and engines that do.

Control cluster

The Kubernetes cluster where Modelplane runs. It needs no GPUs. It holds Modelplane’s Crossplane-based components and the API resources you apply to declare your fleet.

Inference cluster

A GPU cluster in the fleet where serving engines run and tokens are produced. Modelplane can provision inference clusters on EKS, GKE, and other providers, or you can bring your own through an InferenceCluster with source: Existing.

AI tools

The Modelplane docs are built to be read by AI assistants as well as people. You can connect a coding agent directly to this site, pull any page as Markdown, or point a model at a single index file that lists the whole documentation set. Every page also carries a Copy page menu next to its title with the same shortcuts.

Connect to the MCP server

The documentation MCP server lets an assistant search these docs and read any page in real time, so its answers track the current content instead of its training data. It exposes two tools: