<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Deploy Models on Modelplane Docs</title><link>https://docs.modelplane.ai/models/</link><description>Recent content in Deploy Models on Modelplane Docs</description><generator>Hugo -- gohugo.io</generator><language>en-us</language><lastBuildDate>Mon, 01 Jan 0001 00:00:00 +0000</lastBuildDate><atom:link href="https://docs.modelplane.ai/models/index.xml" rel="self" type="application/rss+xml"/><item><title>Deploy a Model</title><link>https://docs.modelplane.ai/models/model-deployment/</link><pubDate/><guid>https://docs.modelplane.ai/models/model-deployment/</guid><description>&lt;p&gt;&lt;strong&gt;API:&lt;/strong&gt; &lt;a href="https://docs.modelplane.ai/reference/modeldeployments/"&gt;&lt;code&gt;modelplane.ai/v1alpha1&lt;/code&gt; · ModelDeployment&lt;/a&gt;&lt;/p&gt;
&lt;!-- vale write-good.Passive = NO --&gt;
&lt;p&gt;A &lt;code&gt;ModelDeployment&lt;/code&gt; is the ML team&amp;rsquo;s primary interface. You describe the model
you want served, the hardware it needs, and how many copies to run; Modelplane
schedules it onto matching clusters and keeps it running. You never name a
cluster.&lt;/p&gt;
&lt;p&gt;Modelplane is unopinionated about the engine itself. You bring the container and
its flags, and Modelplane shapes a serving topology around it. The engine flags
you write carry parallelism, quantization, and KV transfer, never injected by
Modelplane.&lt;/p&gt;</description></item><item><title>Expose a Model</title><link>https://docs.modelplane.ai/models/model-service/</link><pubDate/><guid>https://docs.modelplane.ai/models/model-service/</guid><description>&lt;p&gt;&lt;strong&gt;API:&lt;/strong&gt; &lt;a href="https://docs.modelplane.ai/reference/modelservices/"&gt;&lt;code&gt;modelplane.ai/v1alpha1&lt;/code&gt; · ModelService&lt;/a&gt;&lt;/p&gt;
&lt;!-- vale write-good.Passive = NO --&gt;
&lt;p&gt;A &lt;a href="https://docs.modelplane.ai/models/model-deployment/"&gt;&lt;code&gt;ModelDeployment&lt;/code&gt;&lt;/a&gt; serves a model, but its
replicas are scattered across the fleet with no single address. A &lt;code&gt;ModelService&lt;/code&gt;
gives them one: a stable, unified, OpenAI-compatible URL that load-balances
across every replica, wherever it runs.&lt;/p&gt;
&lt;p&gt;A service selects what to route to by label. Behind the scenes, Modelplane
creates one &lt;code&gt;ModelEndpoint&lt;/code&gt;, a single reachable backend, for each replica of a
deployment and labels it. Two of those labels carry routing intent:&lt;/p&gt;</description></item><item><title>Cache Model Weights</title><link>https://docs.modelplane.ai/models/model-cache/</link><pubDate/><guid>https://docs.modelplane.ai/models/model-cache/</guid><description>&lt;!-- vale write-good.Passive = NO --&gt;
&lt;p&gt;&lt;strong&gt;API:&lt;/strong&gt; &lt;a href="https://docs.modelplane.ai/reference/modelcaches/"&gt;&lt;code&gt;modelplane.ai/v1alpha1&lt;/code&gt; · ModelCache&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;A &lt;code&gt;ModelCache&lt;/code&gt; stages a model&amp;rsquo;s weights on shared workload-cluster storage,
fetched once from the configured source rather than downloaded again on every pod
start. &lt;code&gt;ModelDeployments&lt;/code&gt; reference a cache via &lt;code&gt;spec.modelCacheRef.name&lt;/code&gt;, and
Modelplane mounts it at &lt;code&gt;/mnt/models&lt;/code&gt; in every serving pod, shared across the
pods of a multi-node engine. The engine reads weights locally from the mount.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;ModelCache&lt;/code&gt; is recommended for multi-node deployments and optional for
single-node cold-start optimization.&lt;/p&gt;</description></item><item><title>Route to External Providers</title><link>https://docs.modelplane.ai/models/model-endpoint/</link><pubDate/><guid>https://docs.modelplane.ai/models/model-endpoint/</guid><description>&lt;p&gt;&lt;strong&gt;API:&lt;/strong&gt; &lt;a href="https://docs.modelplane.ai/reference/modelendpoints/"&gt;&lt;code&gt;modelplane.ai/v1alpha1&lt;/code&gt; · ModelEndpoint&lt;/a&gt;&lt;/p&gt;
&lt;!-- vale write-good.Passive = NO --&gt;
&lt;p&gt;A &lt;code&gt;ModelEndpoint&lt;/code&gt; is a single reachable inference endpoint that a
&lt;a href="https://docs.modelplane.ai/models/model-service/"&gt;&lt;code&gt;ModelService&lt;/code&gt;&lt;/a&gt; can route to. Modelplane creates
one for each of your replicas automatically, but you can also create one by hand
to point at an inference endpoint Modelplane doesn&amp;rsquo;t run, most often a SaaS
provider like Together or Baseten. A service treats both the same, so you can
front your own replicas and an external provider behind one URL: send overflow to
the provider when your fleet is busy, or fail over to it as a break-glass option.&lt;/p&gt;</description></item></channel></rss>