Qwen3-Coder-480B
On this page
A 480B code MoE (35B active). Two validated shapes: the BF16 weights span two
H200 nodes as a gang over EFA, served from a ModelCache; the FP8 checkpoint
fits one node, so it runs as a single Standalone engine on SGLang with no
cache.
Both shapes were run end to end; the InferenceClass and ModelDeployment are
the exact manifests from those runs. Apply the platform side first, then the ML
side. The InferenceCluster carries an EC2 capacity reservation placeholder to
edit before applying.
Platform
inference-class.yaml
# InferenceClass for the H200 shape, validated serving Qwen3-Coder-480B
# multi-node on EKS. 8x NVIDIA H200 on an EKS p5en.48xlarge, with EFA.
#
# Both the GPU and the EFA fabric are claim: DRA devices. A multi-node gang's
# nodeSelector requests both, so the scheduler co-schedules the whole gang on a
# pool that has them and DRA binds 8 GPUs + 16 EFA interfaces per pod. The EFA
# device is installed by the EFA DRA driver (DRANET) in the serving stack.
apiVersion: modelplane.ai/v1alpha1
kind: InferenceClass
metadata:
name: eks-h200-8x-p5en
spec:
description: "EKS p5en.48xlarge, 8x NVIDIA H200, EFA"
provisioning:
provider: EKS
eks:
instanceType: p5en.48xlarge
diskSizeGb: 1024
accelerator:
type: nvidia-h200
count: 8
devices:
- name: gpu
claim: DRA
driver: gpu.nvidia.com
deviceClassName: gpu.nvidia.com
count: 8
attributes:
architecture: { string: Hopper }
capacity:
memory: { value: "140Gi" } # advertised below the ~141 GiB the driver reports
- name: efa
claim: DRA
driver: dra.net
deviceClassName: efa.networking.k8s.aws
count: 16
inference-cluster.yaml
# An EKS InferenceCluster with a two-node H200 pool over EFA, validated serving
# Qwen3-Coder-480B as a multi-node gang. The H200 nodes come from an EC2
# Capacity Block reserved for ML.
#
# fabric: EFA turns on Elastic Fabric Adapter for the gang's cross-node traffic;
# without it multi-node NCCL falls back to TCP, which is slow and unstable.
apiVersion: modelplane.ai/v1alpha1
kind: InferenceCluster
metadata:
name: eks-coder
labels:
modelplane.ai/region: us
spec:
cluster:
source: EKS
eks:
region: us-east-2
nodePools:
- name: gpu-h200
className: eks-h200-8x-p5en
nodeCount: 2
minNodeCount: 2
maxNodeCount: 2
zones:
- us-east-2b
fabric: EFA
capacityBlock:
capacityReservationId: cr-0123456789abcdef0 # replace with your reservation ID
bash
curl -fsSL https://docs.modelplane.ai/examples/examples/qwen3-coder/inference-cluster.yaml \
| sed 's/cr-0123456789abcdef0//' \
| kubectl apply -f -inference-class-fp8.yaml
# InferenceClass for the H200 shape without EFA, validated serving the FP8
# Qwen3-Coder-480B checkpoint single-node on SGLang.
#
# The FP8 weights (~480 GB) fit on one 8x H200 node, so this needs no second
# node, no fabric, and no ModelCache - the GPU is the only device.
apiVersion: modelplane.ai/v1alpha1
kind: InferenceClass
metadata:
name: eks-h200-8x-p5en
spec:
description: "EKS p5en.48xlarge, 8x NVIDIA H200"
provisioning:
provider: EKS
eks:
instanceType: p5en.48xlarge
diskSizeGb: 1024
accelerator:
type: nvidia-h200
count: 8
devices:
- name: gpu
claim: DRA
driver: gpu.nvidia.com
deviceClassName: gpu.nvidia.com
count: 8
attributes:
architecture: { string: Hopper }
capacity:
memory: { value: "140Gi" } # advertised below the ~141 GiB the driver reports
Deployment
model-cache.yaml
# The shared, read-write-many cache the multi-node gang serves from. Hydrated
# once per matched cluster from the gated Hugging Face repo; every gang pod
# mounts it at /mnt/models. ~960 GB of BF16 weights, so sizeGiB leaves headroom.
#
# The repo is gated, so it needs a Hugging Face token. Create the authSecret once
# in the ModelCache's namespace on the control plane; Modelplane propagates it to
# each matched cluster.
apiVersion: modelplane.ai/v1alpha1
kind: ModelCache
metadata:
name: qwen3-coder
namespace: ml-team
spec:
source: HuggingFace
huggingFace:
repo: Qwen/Qwen3-Coder-480B-A35B-Instruct
authSecret:
name: hf-token
key: HF_TOKEN
sizeGiB: 1100
model-deployment.yaml
# Qwen3-Coder-480B served BF16 across two H200 nodes, validated end to end on
# EKS over EFA. A 480B MoE doesn't fit one node, so the engine is a Leader +
# Worker gang spanning two nodes via LeaderWorkerSet, both pods mounting the
# shared ModelCache at /mnt/models.
#
# Each member requests 8 GPUs + 16 EFA interfaces per node; the scheduler
# co-schedules the gang on the H200 pool. The worker joins the leader through
# $(MODELPLANE_LEADER_ADDRESS), which Modelplane injects.
#
# Notes on the engine flags:
# --distributed-executor-backend=mp with --nnodes/--node-rank/--master-addr/
# --headless is vLLM's native multiprocessing multi-node path.
# vllm/vllm-openai:v0.23.0 no longer ships Ray, so the Ray-based
# multi-node-serving.sh helper doesn't work on this image; the MP backend
# needs nothing extra.
# TP8 x PP2: tensor-parallel within a node over NVLink, pipeline-parallel
# across the two nodes. tensor-parallel-size = GPUs per node,
# pipeline-parallel-size = nodes.
# --tool-call-parser=qwen3_xml is the parser for Qwen3-Coder specifically
# (the dense Qwen3 models use hermes). The model is non-thinking, so there's
# no reasoning parser.
# --max-model-len=32768 caps context to fit; the native 256K isn't needed.
# FI_PROVIDER=efa / NCCL_DEBUG=INFO point NCCL at the EFA fabric.
apiVersion: modelplane.ai/v1alpha1
kind: ModelDeployment
metadata:
name: qwen3-coder
namespace: ml-team
spec:
replicas: 1
clusterSelector:
matchLabels:
modelplane.ai/region: us
modelCacheRef:
name: qwen3-coder
engines:
- name: qwen3-coder
members:
- role: Leader
nodeSelector:
devices:
- name: gpu
count: 8
selectors:
- cel: |
device.driver == "gpu.nvidia.com" && device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("120Gi")) >= 0
- name: efa
count: 16
selectors:
- cel: |
device.driver == "dra.net"
template:
spec:
containers:
- name: engine
image: vllm/vllm-openai:v0.23.0
env:
- name: NCCL_DEBUG
value: "INFO"
- name: FI_PROVIDER
value: "efa"
command:
- /bin/sh
- -c
- >-
exec vllm serve /mnt/models
--served-model-name=qwen3-coder
--tensor-parallel-size=8
--pipeline-parallel-size=2
--distributed-executor-backend=mp
--nnodes=2 --node-rank=0
--master-addr=$(MODELPLANE_LEADER_ADDRESS)
--max-model-len=32768
--gpu-memory-utilization=0.92
--enable-auto-tool-choice
--tool-call-parser=qwen3_xml
--port=8000
- role: Worker
worker:
nodes: 1
nodeSelector:
devices:
- name: gpu
count: 8
selectors:
- cel: |
device.driver == "gpu.nvidia.com" && device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("120Gi")) >= 0
- name: efa
count: 16
selectors:
- cel: |
device.driver == "dra.net"
template:
spec:
containers:
- name: engine
image: vllm/vllm-openai:v0.23.0
env:
- name: NCCL_DEBUG
value: "INFO"
- name: FI_PROVIDER
value: "efa"
command:
- /bin/sh
- -c
- >-
exec vllm serve /mnt/models
--served-model-name=qwen3-coder
--tensor-parallel-size=8
--pipeline-parallel-size=2
--distributed-executor-backend=mp
--nnodes=2 --node-rank=1
--master-addr=$(MODELPLANE_LEADER_ADDRESS)
--headless
--max-model-len=32768
--gpu-memory-utilization=0.92
model-service.yaml
# Exposes the multi-node BF16 qwen3-coder deployment as a single
# OpenAI-compatible URL. Read the public address from status.address:
# kubectl get ms qwen3-coder -n ml-team -o jsonpath='{.status.address}'
apiVersion: modelplane.ai/v1alpha1
kind: ModelService
metadata:
name: qwen3-coder
namespace: ml-team
spec:
endpoints:
- selector:
matchLabels:
modelplane.ai/deployment: qwen3-coder
model-deployment-fp8.yaml
# Qwen3-Coder-480B served FP8 on a single 8x H200 node with SGLang, validated
# end to end on EKS. The FP8 checkpoint (~480 GB) fits one node, so this is a
# single Standalone engine: no second node, no EFA, no ModelCache. The engine
# pulls the public FP8 repo straight to the node's local disk.
#
# SGLang-specific notes:
# --ep-size 8 is required, not optional. Pure --tp-size 8 fails at FP8 weight
# creation ("output_size ... not divisible by ... block_n = 128"): the
# block-FP8 MoE doesn't shard evenly across 8 tensor-parallel ranks. Expert
# parallelism shards whole experts and gets past it.
# --tool-call-parser qwen3_coder is SGLang's parser name for this model
# (vLLM's is qwen3_xml). The model is non-thinking, so no reasoning parser.
# Image tag matters: lmsysorg/sglang v0.5.11-v0.5.13(.post1) -runtime images
# are broken (ModuleNotFoundError: distro). v0.5.10.post1-runtime is the
# most recent clean tag with Qwen3-Coder support.
# --host 0.0.0.0 --port 8000: SGLang defaults to 127.0.0.1:30000, but
# Modelplane's contract is 0.0.0.0:8000 with a /health probe. Args pass
# through verbatim - Modelplane injects nothing for a non-vLLM engine.
apiVersion: modelplane.ai/v1alpha1
kind: ModelDeployment
metadata:
name: qwen3-coder-sgl
namespace: ml-team
spec:
replicas: 1
clusterSelector:
matchLabels:
modelplane.ai/region: us
engines:
- name: qwen3-coder-sgl
members:
- role: Standalone
nodeSelector:
devices:
- name: gpu
count: 8
selectors:
- cel: |
device.driver == "gpu.nvidia.com" && device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("120Gi")) >= 0
template:
spec:
containers:
- name: engine
image: lmsysorg/sglang:v0.5.10.post1-runtime
command:
- /bin/sh
- -c
- >-
exec python3 -m sglang.launch_server
--model-path Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8
--served-model-name qwen3-coder
--tp-size 8
--ep-size 8
--context-length 32768
--page-size 32
--trust-remote-code
--tool-call-parser qwen3_coder
--host 0.0.0.0
--port 8000
model-service-fp8.yaml
# Exposes the single-node FP8 qwen3-coder-sgl deployment as a single
# OpenAI-compatible URL. Read the public address from status.address:
# kubectl get ms qwen3-coder-sgl -n ml-team -o jsonpath='{.status.address}'
apiVersion: modelplane.ai/v1alpha1
kind: ModelService
metadata:
name: qwen3-coder-sgl
namespace: ml-team
spec:
endpoints:
- selector:
matchLabels:
modelplane.ai/deployment: qwen3-coder-sgl