InferenceCluster Custom Resource
A Kubernetes cluster registered with Modelplane for model serving.
Concept guide: Register a Cluster →
#Metadata
#Example
Manifest
apiVersion: modelplane.ai/v1alpha1
kind: InferenceCluster
metadata:
name: west-gke
spec:
cluster:
source: GKE
gke:
project: my-gcp-project
region: us-central1
nodePools:
- name: h100-pool
className: h100-8x-byo
nodeCount: 2
maxNodeCount: 10
zones: [us-central1-a]
#Spec
EKS cluster configuration. Required when source is EKS.
Bring-your-own cluster configuration. Required when source is Existing. Modelplane manages the inference stack on the cluster but does not provision the cluster itself.
ModelCache configuration for this cluster.
Name of an existing ReadWriteMany StorageClass for ModelCache PVCs. Modelplane doesn’t provision storage on an existing cluster, so the admin must create the StorageClass (it must support ReadWriteMany dynamic provisioning).
Optional reference to a Secret containing cloud provider credentials for IAM-based authentication.
GKE cluster configuration. Required when source is GKE.
Cluster provisioning method.
Capacity Block reservation backing this node pool. EKS only. Large GPU instances (e.g. p5en.48xlarge) are rarely available on demand; AWS allocates them via Capacity Blocks for ML. Set this to back the pool with a Capacity Block you have purchased. The pool’s zones must match the reservation’s Availability Zone, and nodeCount must not exceed the reserved instance count. Omit for on-demand pools.
The ID of the Capacity Reservation backing the Capacity Block (e.g. cr-0123456789abcdef0). Purchasing a Capacity Block yields this ID.
Name of the InferenceClass describing this pool’s hardware.
High-performance node-to-node fabric for multi-node engines. None uses standard VPC networking (ENA/TCP). EFA attaches Elastic Fabric Adapter interfaces to each node for GPUDirect RDMA across nodes, so a gang’s tensor-parallel traffic isn’t capped by TCP. EKS only. Only useful on EFA-capable instance types (e.g. p5en.48xlarge). When any pool sets EFA, Modelplane installs the EFA DRA driver on the cluster and the gang’s pods claim EFA devices alongside their GPUs.
Maximum node count for autoscaling. Omit for fixed-size pools.
#Status
Observed ModelCache RWX storage state.
Effective ReadWriteMany StorageClass name for ModelCache PVCs on this cluster. ModelCache reads this to target the cache PVC.
External IP of the inference gateway on the remote cluster. Used by ModelDeployment for unified endpoint routing.
Node pool name, matching spec.nodePools[].name. Used to pin a ModelReplica to a specific pool via spec.nodePoolName.
Number of nodes in this pool. Derived from maxNodeCount (if autoscaling) or nodeCount.
Namespace where the internal XRs (cluster, backend) were created.