Deploy Models on Modelplane Docs

Deploy a Model

API: modelplane.ai/v1alpha1 · ModelDeployment

A ModelDeployment is the ML team’s primary interface. You describe the model you want served, the hardware it needs, and how many copies to run; Modelplane schedules it onto matching clusters and keeps it running. You never name a cluster.

Modelplane is unopinionated about the engine itself. You bring the container and its flags, and Modelplane shapes a serving topology around it. The engine flags you write carry parallelism, quantization, and KV transfer, never injected by Modelplane.

Expose a Model

API: modelplane.ai/v1alpha1 · ModelService

A ModelDeployment serves a model, but its replicas are scattered across the fleet with no single address. A ModelService gives them one: a stable, unified, OpenAI-compatible URL that load-balances across every replica, wherever it runs.

A service selects what to route to by label. Behind the scenes, Modelplane creates one ModelEndpoint, a single reachable backend, for each replica of a deployment and labels it. Two of those labels carry routing intent:

Cache Model Weights

API: modelplane.ai/v1alpha1 · ModelCache

A ModelCache stages a model’s weights on shared workload-cluster storage, fetched once from the configured source rather than downloaded again on every pod start. ModelDeployments reference a cache via spec.modelCacheRef.name, and Modelplane mounts it at /mnt/models in every serving pod, shared across the pods of a multi-node engine. The engine reads weights locally from the mount.

ModelCache is recommended for multi-node deployments and optional for single-node cold-start optimization.

Route to External Providers

API: modelplane.ai/v1alpha1 · ModelEndpoint

A ModelEndpoint is a single reachable inference endpoint that a ModelService can route to. Modelplane creates one for each of your replicas automatically, but you can also create one by hand to point at an inference endpoint Modelplane doesn’t run, most often a SaaS provider like Together or Baseten. A service treats both the same, so you can front your own replicas and an external provider behind one URL: send overflow to the provider when your fleet is busy, or fail over to it as a break-glass option.