Scale the platform

On this page

You have one L4 cluster with a running model. In this guide, you’ll add two larger-GPU clusters in different regions to grow the fleet available to the ML team.

Provisioning two more clusters takes about 10 to 15 minutes.

Register more clusters

apiVersion: modelplane.ai/v1alpha1
kind: InferenceClass
metadata:
  name: l40s-1x-g6e
spec:
  description: "EKS g6e.xlarge, 1x NVIDIA L40S"
  provisioning:
    provider: EKS
    eks:
      instanceType: g6e.xlarge
      diskSizeGb: 100
      accelerator:
        type: nvidia-l40s
        count: 1
  devices:
  - name: gpu
    claim: DRA
    driver: gpu.nvidia.com
    deviceClassName: gpu.nvidia.com
    count: 1
    attributes:
      architecture: { string: Ada Lovelace }
    capacity:
      memory: { value: "46068Mi" }
---
# g6e.xlarge is available in us-east-1, us-west-2, and eu-central-1.
# eu-west-1 does NOT have g6e.xlarge.
apiVersion: modelplane.ai/v1alpha1
kind: InferenceCluster
metadata:
  name: eks-us-west
  labels:
    modelplane.ai/region: us-west
spec:
  cluster:
    source: EKS
    eks:
      region: us-west-2
  nodePools:
  - name: gpu-l40s
    className: l40s-1x-g6e
    nodeCount: 1
    minNodeCount: 1
    maxNodeCount: 1
    zones:
    - us-west-2a
---
apiVersion: modelplane.ai/v1alpha1
kind: InferenceCluster
metadata:
  name: eks-eu-central
  labels:
    modelplane.ai/region: eu-central
spec:
  cluster:
    source: EKS
    eks:
      region: eu-central-1
  nodePools:
  - name: gpu-l40s
    className: l40s-1x-g6e
    nodeCount: 1
    minNodeCount: 1
    maxNodeCount: 1
    zones:
    - eu-central-1a

Note

g6e.xlarge runs ~$2/hr on demand. Two of them plus the L4 from earlier is a few dollars for this tour. Clean up when you’re done (see Clean up).

Register two more clusters with a bigger hardware class: A100 (40 Gi) in us-west and us-east. Apply the manifest, setting each cluster’s project to your GCP project:

apiVersion: modelplane.ai/v1alpha1
kind: InferenceClass
metadata:
  name: gke-a100-40-1x
spec:
  description: "GKE a2-highgpu-1g, 1x NVIDIA A100 40GB"
  provisioning:
    provider: GKE
    gke:
      machineType: a2-highgpu-1g
      diskSizeGb: 200
      accelerator:
        type: nvidia-tesla-a100
        count: 1
  devices:
  - name: gpu
    claim: DRA
    driver: gpu.nvidia.com
    deviceClassName: gpu.nvidia.com
    count: 1
    attributes:
      architecture: { string: Ampere }
      cudaComputeCapability: { version: "8.0.0" }
    capacity:
      # A100 40GB real reported VRAM. Keep the selector at >= 35Gi (not >= 40Gi)
      # so it reliably clears the L4 (24Gi) without hitting the boundary.
      memory: { value: "40960Mi" }
---
apiVersion: modelplane.ai/v1alpha1
kind: InferenceCluster
metadata:
  name: gpu-us-west
  labels:
    modelplane.ai/region: us-west
spec:
  cluster:
    source: GKE
    gke:
      project: my-gcp-project
      region: us-west1
  nodePools:
  - name: gpu-a100
    className: gke-a100-40-1x
    nodeCount: 1
    minNodeCount: 1   # keep >=1; the autoscaler can't scale a GPU pool up from 0 for DRA pods
    maxNodeCount: 2
    zones:
    - us-west1-b
---
apiVersion: modelplane.ai/v1alpha1
kind: InferenceCluster
metadata:
  name: gpu-us-east
  labels:
    modelplane.ai/region: us-east
spec:
  cluster:
    source: GKE
    gke:
      project: my-gcp-project
      region: us-east1
  nodePools:
  - name: gpu-a100
    className: gke-a100-40-1x
    nodeCount: 1
    minNodeCount: 1   # keep >=1; the autoscaler can't scale a GPU pool up from 0 for DRA pods
    maxNodeCount: 2
    zones:
    - us-east1-b

curl -fsSL https://docs.modelplane.ai/examples/getting-started/gke/platform-scale.yaml \
  | sed 's/my-gcp-project//g' \
  | kubectl apply -f -

Note

a2-highgpu-1g runs ~$3.50/hr on demand. Two of them plus the L4 from earlier is a few dollars for this tour. Clean up when you’re done (see Clean up).

Modelplane provisions both clusters in parallel:

kubectl wait --for=condition=Ready ic --all --timeout=20m

Your model keeps running

Growing the fleet doesn’t disturb anything already deployed. qwen-demo stays on its original cluster and the two new clusters add capacity the moment they’re Ready with no interruption for the ML team. A replica only moves if its deployment changes in a way that no longer fits where it runs.

Next step

The fleet now spans three clusters across three regions. The ML team is next. Scale the model to serve it from two regions behind a single endpoint.