InferenceCluster Custom Resource

On this page

A Kubernetes cluster registered with Modelplane for model serving.

#Metadata

API version: modelplane.ai/v1alpha1
Kind: InferenceCluster
Scope: Cluster
Short names: ic

#Example

Manifest

apiVersion: modelplane.ai/v1alpha1
kind: InferenceCluster
metadata:
  name: west-gke
spec:
  cluster:
    source: GKE
    gke:
      project: my-gcp-project
      region: us-central1
  nodePools:
    - name: h100-pool
      className: h100-8x-byo
      nodeCount: 2
      maxNodeCount: 10
      zones: [us-central1-a]

#Spec

EKS cluster configuration. Required when source is EKS.

EKS cluster Kubernetes version. Defaults to a version where Dynamic Resource Allocation (how GPUs bind to pods) is generally available.

AWS region for the cluster (e.g. us-west-2).

Bring-your-own cluster configuration. Required when source is Existing. Modelplane manages the inference stack on the cluster but does not provision the cluster itself.

ModelCache configuration for this cluster.

Name of an existing ReadWriteMany StorageClass for ModelCache PVCs. Modelplane doesn’t provision storage on an existing cluster, so the admin must create the StorageClass (it must support ReadWriteMany dynamic provisioning).

Optional reference to a Secret containing cloud provider credentials for IAM-based authentication.

Reference to a Secret containing a kubeconfig for the existing cluster. The Secret must exist in the modelplane-system namespace.

GKE cluster configuration. Required when source is GKE.

Cluster provisioning method.

Capacity Block reservation backing this node pool. EKS only. Large GPU instances (e.g. p5en.48xlarge) are rarely available on demand; AWS allocates them via Capacity Blocks for ML. Set this to back the pool with a Capacity Block you have purchased. The pool’s zones must match the reservation’s Availability Zone, and nodeCount must not exceed the reserved instance count. Omit for on-demand pools.

The ID of the Capacity Reservation backing the Capacity Block (e.g. cr-0123456789abcdef0). Purchasing a Capacity Block yields this ID.

pattern: ^cr-[0-9a-f]+$

Name of the InferenceClass describing this pool’s hardware.

High-performance node-to-node fabric for multi-node engines. None uses standard VPC networking (ENA/TCP). EFA attaches Elastic Fabric Adapter interfaces to each node for GPUDirect RDMA across nodes, so a gang’s tensor-parallel traffic isn’t capped by TCP. EKS only. Only useful on EFA-capable instance types (e.g. p5en.48xlarge). When any pool sets EFA, Modelplane installs the EFA DRA driver on the cluster and the gang’s pods claim EFA devices alongside their GPUs.

Maximum node count for autoscaling. Omit for fixed-size pools.

#Status

Observed ModelCache RWX storage state.

Effective ReadWriteMany StorageClass name for ModelCache PVCs on this cluster. ModelCache reads this to target the cache PVC.

External IP of the inference gateway on the remote cluster. Used by ModelDeployment for unified endpoint routing.

Node pool name, matching spec.nodePools[].name. Used to pin a ModelReplica to a specific pool via spec.nodePoolName.

Number of nodes in this pool. Derived from maxNodeCount (if autoscaling) or nodeCount.

Namespace where the internal XRs (cluster, backend) were created.

Name of the ProviderConfig targeting the remote cluster. Used by ModelReplica to create resources on the cluster.