Scale the platform
You have one L4 cluster with a running model. In this guide, you’ll add two larger-GPU clusters in different regions to grow the fleet available to the ML team.
Provisioning two more clusters takes about 10 to 15 minutes.
Register more clusters
Register two more clusters with a bigger hardware class: L40S (48 Gi) in
us-west and eu-central:
apiVersion: modelplane.ai/v1alpha1
kind: InferenceClass
metadata:
name: l40s-1x-g6e
spec:
description: "EKS g6e.xlarge, 1x NVIDIA L40S"
provisioning:
provider: EKS
eks:
instanceType: g6e.xlarge
diskSizeGb: 100
accelerator:
type: nvidia-l40s
count: 1
devices:
- name: gpu
claim: DRA
driver: gpu.nvidia.com
deviceClassName: gpu.nvidia.com
count: 1
attributes:
architecture: { string: Ada Lovelace }
capacity:
memory: { value: "46068Mi" }
---
# g6e.xlarge is available in us-east-1, us-west-2, and eu-central-1.
# eu-west-1 does NOT have g6e.xlarge.
apiVersion: modelplane.ai/v1alpha1
kind: InferenceCluster
metadata:
name: eks-us-west
labels:
modelplane.ai/region: us-west
spec:
cluster:
source: EKS
eks:
region: us-west-2
nodePools:
- name: gpu-l40s
className: l40s-1x-g6e
nodeCount: 1
minNodeCount: 1
maxNodeCount: 1
zones:
- us-west-2a
---
apiVersion: modelplane.ai/v1alpha1
kind: InferenceCluster
metadata:
name: eks-eu-central
labels:
modelplane.ai/region: eu-central
spec:
cluster:
source: EKS
eks:
region: eu-central-1
nodePools:
- name: gpu-l40s
className: l40s-1x-g6e
nodeCount: 1
minNodeCount: 1
maxNodeCount: 1
zones:
- eu-central-1a
g6e.xlarge runs ~$2/hr on demand. Two of them plus the L4 from earlier is a
few dollars for this tour. Clean up when you’re done (see Clean
up).Register two more clusters with a bigger hardware class: A100 (40 Gi) in
us-west and us-east. Apply the manifest, setting each cluster’s project to
your GCP project:
apiVersion: modelplane.ai/v1alpha1
kind: InferenceClass
metadata:
name: gke-a100-40-1x
spec:
description: "GKE a2-highgpu-1g, 1x NVIDIA A100 40GB"
provisioning:
provider: GKE
gke:
machineType: a2-highgpu-1g
diskSizeGb: 200
accelerator:
type: nvidia-tesla-a100
count: 1
devices:
- name: gpu
claim: DRA
driver: gpu.nvidia.com
deviceClassName: gpu.nvidia.com
count: 1
attributes:
architecture: { string: Ampere }
cudaComputeCapability: { version: "8.0.0" }
capacity:
# A100 40GB real reported VRAM. Keep the selector at >= 35Gi (not >= 40Gi)
# so it reliably clears the L4 (24Gi) without hitting the boundary.
memory: { value: "40960Mi" }
---
apiVersion: modelplane.ai/v1alpha1
kind: InferenceCluster
metadata:
name: gpu-us-west
labels:
modelplane.ai/region: us-west
spec:
cluster:
source: GKE
gke:
project: my-gcp-project
region: us-west1
nodePools:
- name: gpu-a100
className: gke-a100-40-1x
nodeCount: 1
minNodeCount: 1 # keep >=1; the autoscaler can't scale a GPU pool up from 0 for DRA pods
maxNodeCount: 2
zones:
- us-west1-b
---
apiVersion: modelplane.ai/v1alpha1
kind: InferenceCluster
metadata:
name: gpu-us-east
labels:
modelplane.ai/region: us-east
spec:
cluster:
source: GKE
gke:
project: my-gcp-project
region: us-east1
nodePools:
- name: gpu-a100
className: gke-a100-40-1x
nodeCount: 1
minNodeCount: 1 # keep >=1; the autoscaler can't scale a GPU pool up from 0 for DRA pods
maxNodeCount: 2
zones:
- us-east1-b
curl -fsSL https://docs.modelplane.ai/examples/getting-started/gke/platform-scale.yaml \
| sed 's/my-gcp-project//g' \
| kubectl apply -f -a2-highgpu-1g runs ~$3.50/hr on demand. Two of them plus the L4 from earlier
is a few dollars for this tour. Clean up when you’re done (see Clean
up).Modelplane provisions both clusters in parallel:
kubectl wait --for=condition=Ready ic --all --timeout=20mYour model keeps running
Growing the fleet doesn’t disturb anything already deployed. qwen-demo stays
on its original cluster and the two new clusters add capacity the moment
they’re Ready with no interruption for the ML team. A replica only moves if
its deployment changes in a way that no longer fits where it runs.
Next step
The fleet now spans three clusters across three regions. The ML team is next. Scale the model to serve it from two regions behind a single endpoint.