Qwen3-8B

An 8.2B dense chat model on a single NVIDIA L4. The smallest recipe: one Standalone engine, no cache, weights pulled straight from Hugging Face.

This recipe was run end to end; the InferenceClass and ModelDeployment are the exact manifests from that run. Apply the platform side first, then the ML side.

Platform

# InferenceClass for the L4 shape, validated serving Qwen3-8B on EKS.
#
# One NVIDIA L4 on an EKS g6.xlarge. The single GPU is a claim: DRA device;
# the scheduler matches a ModelDeployment's nodeSelector against its declared
# capacity and DRA binds it to the serving pod.
apiVersion: modelplane.ai/v1alpha1
kind: InferenceClass
metadata:
 name: eks-l4-1x-g6
spec:
 description: "EKS g6.xlarge, 1x NVIDIA L4"
 provisioning:
 provider: EKS
 eks:
 instanceType: g6.xlarge
 diskSizeGb: 100
 accelerator:
 type: nvidia-l4
 count: 1
 devices:
 - name: gpu
 claim: DRA
 driver: gpu.nvidia.com
 deviceClassName: gpu.nvidia.com
 count: 1
 attributes:
 architecture: { string: Ada Lovelace }
 capacity:
 # The L4's real usable VRAM as the NVIDIA DRA driver reports it, not the
 # nominal 24GB.
 memory: { value: "23034Mi" }

# An EKS InferenceCluster with one L4 node pool, labeled for the
# ModelDeployment's clusterSelector to target.
apiVersion: modelplane.ai/v1alpha1
kind: InferenceCluster
metadata:
 name: eks-l4
 labels:
 modelplane.ai/region: us
spec:
 cluster:
 source: EKS
 eks:
 region: us-west-2
 nodePools:
 - name: gpu-l4
 className: eks-l4-1x-g6
 nodeCount: 1
 minNodeCount: 1
 maxNodeCount: 1
 zones:
 - us-west-2a

Deployment

# Qwen3-8B served on a single NVIDIA L4, validated end to end on EKS.
#
# An 8.2B dense model is a single Standalone engine: one self-contained vLLM
# pod, no ModelCache, weights pulled straight from Hugging Face. The flags carry
# real meaning beyond fit:
#
# --tool-call-parser=hermes the parser for Qwen3 dense (qwen3_xml is
# for Qwen3-Coder, not this model). Qwen3's
# tool-use template ships in the tokenizer,
# so no --chat-template is needed.
# --reasoning-parser=qwen3 with
# --default-chat-template-kwargs turns thinking off. Qwen3 thinks by
# default, burying a one-line answer under a
# <think> block and forbidding greedy decode.
# --max-model-len / --gpu-memory-utilization L4 fit, not correctness.
#
# No --port or --host: Modelplane's routing expects the engine on its default
# :8000 with a /health probe, and passes args through verbatim.
apiVersion: modelplane.ai/v1alpha1
kind: ModelDeployment
metadata:
 name: qwen3-8b
 namespace: ml-team
spec:
 replicas: 1
 clusterSelector:
 matchLabels:
 modelplane.ai/region: us
 engines:
 - name: qwen3-8b
 members:
 - role: Standalone
 nodeSelector:
 devices:
 - name: gpu
 count: 1
 selectors:
 - cel: |
 device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("20Gi")) >= 0
 template:
 spec:
 containers:
 - name: engine
 image: vllm/vllm-openai:v0.23.0
 args:
 - "--model=Qwen/Qwen3-8B"
 - "--served-model-name=qwen"
 - "--max-model-len=16384"
 - "--gpu-memory-utilization=0.92"
 - "--reasoning-parser=qwen3"
 - "--default-chat-template-kwargs={\"enable_thinking\": false}"
 - "--enable-auto-tool-choice"
 - "--tool-call-parser=hermes"

# Exposes the qwen3-8b deployment's endpoints as a single OpenAI-compatible URL.
# Modelplane labels each composed ModelEndpoint with the deployment name, so this
# selector reaches every replica. Read the public address from status.address:
# kubectl get ms qwen3-8b -n ml-team -o jsonpath='{.status.address}'
apiVersion: modelplane.ai/v1alpha1
kind: ModelService
metadata:
 name: qwen3-8b
 namespace: ml-team
spec:
 endpoints:
 - selector:
 matchLabels:
 modelplane.ai/deployment: qwen3-8b

Qwen3-Coder-480B

A 480B code MoE (35B active). Two validated shapes: the BF16 weights span two H200 nodes as a gang over EFA, served from a ModelCache; the FP8 checkpoint fits one node, so it runs as a single Standalone engine on SGLang with no cache.

Both shapes were run end to end; the InferenceClass and ModelDeployment are the exact manifests from those runs. Apply the platform side first, then the ML side. The InferenceCluster carries an EC2 capacity reservation placeholder to edit before applying.

Kimi-K2

A 1T MoE (1 trillion parameters) served prefill/decode disaggregated across two H200 nodes: two engines, one per phase, with Modelplane composing the llm-d routing layer between them. This recipe serves an INT4 quantization of the model; the native FP8 weights need four such nodes.

This recipe was run end to end; the InferenceClass and ModelDeployment are the exact manifests from that run. Apply the platform side first, then the ML side. The InferenceCluster carries an EC2 capacity reservation placeholder to edit before applying.

Llama-3.1-8B

An 8B dense chat model on a single NVIDIA L4. The entry recipe: one Standalone engine, no cache, public weights from a Hugging Face mirror. It carries no clusterSelector, so device capacity alone matches it to any compatible L4 in the fleet.