Qwen3-Coder-480B

On this page

A 480B code MoE (35B active). Two validated shapes: the BF16 weights span two H200 nodes as a gang over EFA, served from a ModelCache; the FP8 checkpoint fits one node, so it runs as a single Standalone engine on SGLang with no cache.

Both shapes were run end to end; the InferenceClass and ModelDeployment are the exact manifests from those runs. Apply the platform side first, then the ML side. The InferenceCluster carries an EC2 capacity reservation placeholder to edit before applying.

Platform

# InferenceClass for the H200 shape, validated serving Qwen3-Coder-480B
# multi-node on EKS. 8x NVIDIA H200 on an EKS p5en.48xlarge, with EFA.
#
# Both the GPU and the EFA fabric are claim: DRA devices. A multi-node gang's
# nodeSelector requests both, so the scheduler co-schedules the whole gang on a
# pool that has them and DRA binds 8 GPUs + 16 EFA interfaces per pod. The EFA
# device is installed by the EFA DRA driver (DRANET) in the serving stack.
apiVersion: modelplane.ai/v1alpha1
kind: InferenceClass
metadata:
  name: eks-h200-8x-p5en
spec:
  description: "EKS p5en.48xlarge, 8x NVIDIA H200, EFA"
  provisioning:
    provider: EKS
    eks:
      instanceType: p5en.48xlarge
      diskSizeGb: 1024
      accelerator:
        type: nvidia-h200
        count: 8
  devices:
  - name: gpu
    claim: DRA
    driver: gpu.nvidia.com
    deviceClassName: gpu.nvidia.com
    count: 8
    attributes:
      architecture: { string: Hopper }
    capacity:
      memory: { value: "140Gi" }   # advertised below the ~141 GiB the driver reports
  - name: efa
    claim: DRA
    driver: dra.net
    deviceClassName: efa.networking.k8s.aws
    count: 16

# An EKS InferenceCluster with a two-node H200 pool over EFA, validated serving
# Qwen3-Coder-480B as a multi-node gang. The H200 nodes come from an EC2
# Capacity Block reserved for ML.
#
# fabric: EFA turns on Elastic Fabric Adapter for the gang's cross-node traffic;
# without it multi-node NCCL falls back to TCP, which is slow and unstable.
apiVersion: modelplane.ai/v1alpha1
kind: InferenceCluster
metadata:
  name: eks-coder
  labels:
    modelplane.ai/region: us
spec:
  cluster:
    source: EKS
    eks:
      region: us-east-2
  nodePools:
  - name: gpu-h200
    className: eks-h200-8x-p5en
    nodeCount: 2
    minNodeCount: 2
    maxNodeCount: 2
    zones:
    - us-east-2b
    fabric: EFA
    capacityBlock:
      capacityReservationId: cr-0123456789abcdef0  # replace with your reservation ID

curl -fsSL https://docs.modelplane.ai/examples/examples/qwen3-coder/inference-cluster.yaml \
  | sed 's/cr-0123456789abcdef0//' \
  | kubectl apply -f -

# InferenceClass for the H200 shape without EFA, validated serving the FP8
# Qwen3-Coder-480B checkpoint single-node on SGLang.
#
# The FP8 weights (~480 GB) fit on one 8x H200 node, so this needs no second
# node, no fabric, and no ModelCache - the GPU is the only device.
apiVersion: modelplane.ai/v1alpha1
kind: InferenceClass
metadata:
  name: eks-h200-8x-p5en
spec:
  description: "EKS p5en.48xlarge, 8x NVIDIA H200"
  provisioning:
    provider: EKS
    eks:
      instanceType: p5en.48xlarge
      diskSizeGb: 1024
      accelerator:
        type: nvidia-h200
        count: 8
  devices:
  - name: gpu
    claim: DRA
    driver: gpu.nvidia.com
    deviceClassName: gpu.nvidia.com
    count: 8
    attributes:
      architecture: { string: Hopper }
    capacity:
      memory: { value: "140Gi" }   # advertised below the ~141 GiB the driver reports

Deployment

# The shared, read-write-many cache the multi-node gang serves from. Hydrated
# once per matched cluster from the gated Hugging Face repo; every gang pod
# mounts it at /mnt/models. ~960 GB of BF16 weights, so sizeGiB leaves headroom.
#
# The repo is gated, so it needs a Hugging Face token. Create the authSecret once
# in the ModelCache's namespace on the control plane; Modelplane propagates it to
# each matched cluster.
apiVersion: modelplane.ai/v1alpha1
kind: ModelCache
metadata:
  name: qwen3-coder
  namespace: ml-team
spec:
  source: HuggingFace
  huggingFace:
    repo: Qwen/Qwen3-Coder-480B-A35B-Instruct
    authSecret:
      name: hf-token
      key: HF_TOKEN
    sizeGiB: 1100

# Qwen3-Coder-480B served BF16 across two H200 nodes, validated end to end on
# EKS over EFA. A 480B MoE doesn't fit one node, so the engine is a Leader +
# Worker gang spanning two nodes via LeaderWorkerSet, both pods mounting the
# shared ModelCache at /mnt/models.
#
# Each member requests 8 GPUs + 16 EFA interfaces per node; the scheduler
# co-schedules the gang on the H200 pool. The worker joins the leader through
# $(MODELPLANE_LEADER_ADDRESS), which Modelplane injects.
#
# Notes on the engine flags:
#   --distributed-executor-backend=mp with --nnodes/--node-rank/--master-addr/
#     --headless is vLLM's native multiprocessing multi-node path.
#     vllm/vllm-openai:v0.23.0 no longer ships Ray, so the Ray-based
#     multi-node-serving.sh helper doesn't work on this image; the MP backend
#     needs nothing extra.
#   TP8 x PP2: tensor-parallel within a node over NVLink, pipeline-parallel
#     across the two nodes. tensor-parallel-size = GPUs per node,
#     pipeline-parallel-size = nodes.
#   --tool-call-parser=qwen3_xml is the parser for Qwen3-Coder specifically
#     (the dense Qwen3 models use hermes). The model is non-thinking, so there's
#     no reasoning parser.
#   --max-model-len=32768 caps context to fit; the native 256K isn't needed.
#   FI_PROVIDER=efa / NCCL_DEBUG=INFO point NCCL at the EFA fabric.
apiVersion: modelplane.ai/v1alpha1
kind: ModelDeployment
metadata:
  name: qwen3-coder
  namespace: ml-team
spec:
  replicas: 1
  clusterSelector:
    matchLabels:
      modelplane.ai/region: us
  modelCacheRef:
    name: qwen3-coder
  engines:
  - name: qwen3-coder
    members:
    - role: Leader
      nodeSelector:
        devices:
        - name: gpu
          count: 8
          selectors:
          - cel: |
              device.driver == "gpu.nvidia.com" && device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("120Gi")) >= 0
        - name: efa
          count: 16
          selectors:
          - cel: |
              device.driver == "dra.net"
      template:
        spec:
          containers:
          - name: engine
            image: vllm/vllm-openai:v0.23.0
            env:
            - name: NCCL_DEBUG
              value: "INFO"
            - name: FI_PROVIDER
              value: "efa"
            command:
            - /bin/sh
            - -c
            - >-
              exec vllm serve /mnt/models
              --served-model-name=qwen3-coder
              --tensor-parallel-size=8
              --pipeline-parallel-size=2
              --distributed-executor-backend=mp
              --nnodes=2 --node-rank=0
              --master-addr=$(MODELPLANE_LEADER_ADDRESS)
              --max-model-len=32768
              --gpu-memory-utilization=0.92
              --enable-auto-tool-choice
              --tool-call-parser=qwen3_xml
              --port=8000
    - role: Worker
      worker:
        nodes: 1
      nodeSelector:
        devices:
        - name: gpu
          count: 8
          selectors:
          - cel: |
              device.driver == "gpu.nvidia.com" && device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("120Gi")) >= 0
        - name: efa
          count: 16
          selectors:
          - cel: |
              device.driver == "dra.net"
      template:
        spec:
          containers:
          - name: engine
            image: vllm/vllm-openai:v0.23.0
            env:
            - name: NCCL_DEBUG
              value: "INFO"
            - name: FI_PROVIDER
              value: "efa"
            command:
            - /bin/sh
            - -c
            - >-
              exec vllm serve /mnt/models
              --served-model-name=qwen3-coder
              --tensor-parallel-size=8
              --pipeline-parallel-size=2
              --distributed-executor-backend=mp
              --nnodes=2 --node-rank=1
              --master-addr=$(MODELPLANE_LEADER_ADDRESS)
              --headless
              --max-model-len=32768
              --gpu-memory-utilization=0.92

# Exposes the multi-node BF16 qwen3-coder deployment as a single
# OpenAI-compatible URL. Read the public address from status.address:
#   kubectl get ms qwen3-coder -n ml-team -o jsonpath='{.status.address}'
apiVersion: modelplane.ai/v1alpha1
kind: ModelService
metadata:
  name: qwen3-coder
  namespace: ml-team
spec:
  endpoints:
  - selector:
      matchLabels:
        modelplane.ai/deployment: qwen3-coder

# Qwen3-Coder-480B served FP8 on a single 8x H200 node with SGLang, validated
# end to end on EKS. The FP8 checkpoint (~480 GB) fits one node, so this is a
# single Standalone engine: no second node, no EFA, no ModelCache. The engine
# pulls the public FP8 repo straight to the node's local disk.
#
# SGLang-specific notes:
#   --ep-size 8 is required, not optional. Pure --tp-size 8 fails at FP8 weight
#     creation ("output_size ... not divisible by ... block_n = 128"): the
#     block-FP8 MoE doesn't shard evenly across 8 tensor-parallel ranks. Expert
#     parallelism shards whole experts and gets past it.
#   --tool-call-parser qwen3_coder is SGLang's parser name for this model
#     (vLLM's is qwen3_xml). The model is non-thinking, so no reasoning parser.
#   Image tag matters: lmsysorg/sglang v0.5.11-v0.5.13(.post1) -runtime images
#     are broken (ModuleNotFoundError: distro). v0.5.10.post1-runtime is the
#     most recent clean tag with Qwen3-Coder support.
#   --host 0.0.0.0 --port 8000: SGLang defaults to 127.0.0.1:30000, but
#     Modelplane's contract is 0.0.0.0:8000 with a /health probe. Args pass
#     through verbatim - Modelplane injects nothing for a non-vLLM engine.
apiVersion: modelplane.ai/v1alpha1
kind: ModelDeployment
metadata:
  name: qwen3-coder-sgl
  namespace: ml-team
spec:
  replicas: 1
  clusterSelector:
    matchLabels:
      modelplane.ai/region: us
  engines:
  - name: qwen3-coder-sgl
    members:
    - role: Standalone
      nodeSelector:
        devices:
        - name: gpu
          count: 8
          selectors:
          - cel: |
              device.driver == "gpu.nvidia.com" && device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("120Gi")) >= 0
      template:
        spec:
          containers:
          - name: engine
            image: lmsysorg/sglang:v0.5.10.post1-runtime
            command:
            - /bin/sh
            - -c
            - >-
              exec python3 -m sglang.launch_server
              --model-path Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8
              --served-model-name qwen3-coder
              --tp-size 8
              --ep-size 8
              --context-length 32768
              --page-size 32
              --trust-remote-code
              --tool-call-parser qwen3_coder
              --host 0.0.0.0
              --port 8000

# Exposes the single-node FP8 qwen3-coder-sgl deployment as a single
# OpenAI-compatible URL. Read the public address from status.address:
#   kubectl get ms qwen3-coder-sgl -n ml-team -o jsonpath='{.status.address}'
apiVersion: modelplane.ai/v1alpha1
kind: ModelService
metadata:
  name: qwen3-coder-sgl
  namespace: ml-team
spec:
  endpoints:
  - selector:
      matchLabels:
        modelplane.ai/deployment: qwen3-coder-sgl