Collecting engine metrics
Scraping an inference engine’s Prometheus metrics, shown on the smallest serving
shape: a 0.5B Qwen chat model on one NVIDIA L4. vLLM publishes metrics at
/metrics on its serving port with no extra flag, and Modelplane runs a
Prometheus on every workload cluster with PodMonitor discovery open across
namespaces, so scraping the engine is a PodMonitor plus a port-forward. The
model is only the subject; the same wiring fits any engine, with the SGLang,
leader/worker, and prefill/decode differences noted at the end.
This was run end to end on GKE. The InferenceClass and ModelDeployment are the
exact manifests from that run, and the PodMonitor below scraped this deployment.
Apply the platform side first, then the ML side. The GKE InferenceCluster
carries a GCP project placeholder to edit before applying.
Platform
# InferenceClass for the L4 shape on GKE, validated serving Qwen2.5-0.5B.
#
# One NVIDIA L4 on a g2-standard-8. The GPU is declared as a DRA device: the
# scheduler matches a ModelDeployment's nodeSelector against this capacity, then
# DRA binds the physical GPU to the serving pod. A 0.5B model uses a sliver of
# this L4 - the shape is shared with the larger L4 examples, not sized for it.
apiVersion: modelplane.ai/v1alpha1
kind: InferenceClass
metadata:
name: gke-l4-1x-g2
spec:
description: "GKE g2-standard-8, 1x NVIDIA L4"
provisioning:
provider: GKE
gke:
machineType: g2-standard-8
diskSizeGb: 100
accelerator:
type: nvidia-l4
count: 1
devices:
- name: gpu
claim: DRA
driver: gpu.nvidia.com
deviceClassName: gpu.nvidia.com
count: 1
attributes:
architecture: { string: Ada Lovelace }
capacity:
# The L4's real usable VRAM as the NVIDIA DRA driver reports it, not the
# nominal 24GB.
memory: { value: "23034Mi" }
# GKE InferenceCluster with one L4 node pool. Replace the project ID before
# applying. No clusterSelector targets it; the ModelDeployment matches on device
# capacity alone, so it lands here or on any other compatible cluster.
#
# Modelplane installs an in-cluster Prometheus on this cluster (the monitoring
# namespace) with open PodMonitor discovery - the metrics section of the example
# scrapes the engine through it.
apiVersion: modelplane.ai/v1alpha1
kind: InferenceCluster
metadata:
name: gke-l4-single
labels:
modelplane.ai/cloud: gke
modelplane.ai/region: us-central
spec:
cluster:
source: GKE
gke:
project: my-gcp-project # Replace with your GCP project ID.
region: us-central1
nodePools:
- name: gpu-l4
className: gke-l4-1x-g2
nodeCount: 1
zones:
- us-central1-a
minNodeCount: 0
maxNodeCount: 4
curl -fsSL https://main.docs.modelplane.ai/examples/examples/collecting-engine-metrics/inference-cluster.yaml \
| sed 's/my-gcp-project//' \
| kubectl apply -f -Deployment
# Qwen2.5-0.5B-Instruct served on a single NVIDIA L4 by vLLM, validated end to
# end on GKE.
#
# A 0.5B dense model is the smallest useful serving shape: one Standalone vLLM
# pod, no ModelCache, weights pulled straight from Hugging Face. It barely touches
# the L4's VRAM, so the flags here are about behavior, not fit:
#
# --max-model-len=16384 caps the context; the default 32K KV cache is wasteful
# for a model this size and a demo this small.
# --served-model-name the id clients pass as "model" in OpenAI requests.
#
# vLLM exposes Prometheus metrics at /metrics on its serving port (:8000) with no
# extra flag, which is what the example's PodMonitor scrapes.
#
# No --port or --host: Modelplane's routing expects the engine on its default
# :8000 with a /health probe, and passes args through verbatim.
apiVersion: modelplane.ai/v1alpha1
kind: ModelDeployment
metadata:
name: qwen2-5-0-5b
namespace: ml-team
spec:
# One replica, matched to any compatible InferenceCluster by device capacity.
replicas: 1
engines:
- name: qwen2-5-0-5b
members:
# A single self-contained vLLM pod. The container named "engine" is the
# inference server; its image and args pass through verbatim.
- role: Standalone
nodeSelector:
devices:
- name: gpu
count: 1
selectors:
# The model needs almost no VRAM; >=20Gi simply pins it to the L4 pool
# this example provisions (the L4 reports ~23Gi). DRA evaluates this CEL
# against the InferenceClass device, then against the GPU's
# ResourceSlice when it binds the claim.
- cel: |
device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("20Gi")) >= 0
template:
spec:
containers:
- name: engine
image: vllm/vllm-openai:v0.23.0
args:
- "--model=Qwen/Qwen2.5-0.5B-Instruct"
- "--served-model-name=qwen2.5-0.5b"
- "--max-model-len=16384"
# Exposes the qwen2-5-0-5b deployment's endpoints as a single OpenAI-compatible
# URL. Modelplane labels each composed ModelEndpoint with the deployment name, so
# this selector reaches every replica. Read the public address from status.address:
# kubectl get ms qwen2-5-0-5b -n ml-team -o jsonpath='{.status.address}'
apiVersion: modelplane.ai/v1alpha1
kind: ModelService
metadata:
name: qwen2-5-0-5b
namespace: ml-team
spec:
endpoints:
- selector:
matchLabels:
modelplane.ai/deployment: qwen2-5-0-5b
Scraping the metrics
The PodMonitor selects engine pods by the modelplane.ai/serving label
Modelplane stamps on them, and the monitoring namespace Prometheus discovers any
PodMonitor, so this is the whole config. The engine container port is unnamed,
so reference it by number with targetPort:
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: qwen2-5-0-5b-metrics
namespace: default
spec:
selector:
matchExpressions:
- key: modelplane.ai/serving # carried by every serving pod
operator: Exists
podMetricsEndpoints:
- targetPort: 8000
path: /metrics
interval: 30sThe engine pods and the PodMonitor CRD live on the workload cluster, not the
control plane, so apply it there. Then read the metrics from the in-cluster
Prometheus over a port-forward:
kubectl apply -f podmonitor.yaml # workload cluster
kubectl -n monitoring port-forward svc/prometheus-prometheus 9090:9090
# open http://localhost:9090, Status > Targets to confirm the scrape, then query
# e.g. vllm:num_requests_running or vllm:gpu_cache_usage_percOther engine shapes
The PodMonitor above fits a single-pod vLLM engine. The selector and port shift
by shape:
- SGLang: exposes
/metricsonly when the engine runs with--enable-metrics; otherwise it’s identical (same selector,targetPort: 8000). - Leader/worker: only the leader serves the API and carries
modelplane.ai/serving, so the selector above already scrapes the leader alone; the workers expose nothing. - prefill/decode: two engines, labelled
llm-d.ai/role: prefillandllm-d.ai/role: decode. The prefill engine serves on8000; the decode engine sits behind the routing sidecar that takes8000and listens on8001, so scrape decode withtargetPort: 8001. Select each by its role label to keep them apart.