Collecting engine metrics

On this page

Scraping an inference engine’s Prometheus metrics, shown on the smallest serving shape: a 0.5B Qwen chat model on one NVIDIA L4. vLLM publishes metrics at /metrics on its serving port with no extra flag, and Modelplane runs a Prometheus on every workload cluster with PodMonitor discovery open across namespaces, so scraping the engine is a PodMonitor plus a port-forward. The model is only the subject; the same wiring fits any engine, with the SGLang, leader/worker, and prefill/decode differences noted at the end.

This was run end to end on GKE. The InferenceClass and ModelDeployment are the exact manifests from that run, and the PodMonitor below scraped this deployment. Apply the platform side first, then the ML side. The GKE InferenceCluster carries a GCP project placeholder to edit before applying.

Platform

# InferenceClass for the L4 shape on GKE, validated serving Qwen2.5-0.5B.
#
# One NVIDIA L4 on a g2-standard-8. The GPU is declared as a DRA device: the
# scheduler matches a ModelDeployment's nodeSelector against this capacity, then
# DRA binds the physical GPU to the serving pod. A 0.5B model uses a sliver of
# this L4 - the shape is shared with the larger L4 examples, not sized for it.
apiVersion: modelplane.ai/v1alpha1
kind: InferenceClass
metadata:
  name: gke-l4-1x-g2
spec:
  description: "GKE g2-standard-8, 1x NVIDIA L4"
  provisioning:
    provider: GKE
    gke:
      machineType: g2-standard-8
      diskSizeGb: 100
      accelerator:
        type: nvidia-l4
        count: 1
  devices:
  - name: gpu
    claim: DRA
    driver: gpu.nvidia.com
    deviceClassName: gpu.nvidia.com
    count: 1
    attributes:
      architecture: { string: Ada Lovelace }
    capacity:
      # The L4's real usable VRAM as the NVIDIA DRA driver reports it, not the
      # nominal 24GB.
      memory: { value: "23034Mi" }

# GKE InferenceCluster with one L4 node pool. Replace the project ID before
# applying. No clusterSelector targets it; the ModelDeployment matches on device
# capacity alone, so it lands here or on any other compatible cluster.
#
# Modelplane installs an in-cluster Prometheus on this cluster (the monitoring
# namespace) with open PodMonitor discovery - the metrics section of the example
# scrapes the engine through it.
apiVersion: modelplane.ai/v1alpha1
kind: InferenceCluster
metadata:
  name: gke-l4-single
  labels:
    modelplane.ai/cloud: gke
    modelplane.ai/region: us-central
spec:
  cluster:
    source: GKE
    gke:
      project: my-gcp-project  # Replace with your GCP project ID.
      region: us-central1
  nodePools:
  - name: gpu-l4
    className: gke-l4-1x-g2
    nodeCount: 1
    zones:
    - us-central1-a
    minNodeCount: 0
    maxNodeCount: 4

curl -fsSL https://main.docs.modelplane.ai/examples/examples/collecting-engine-metrics/inference-cluster.yaml \
  | sed 's/my-gcp-project//' \
  | kubectl apply -f -

Deployment

# Qwen2.5-0.5B-Instruct served on a single NVIDIA L4 by vLLM, validated end to
# end on GKE.
#
# A 0.5B dense model is the smallest useful serving shape: one Standalone vLLM
# pod, no ModelCache, weights pulled straight from Hugging Face. It barely touches
# the L4's VRAM, so the flags here are about behavior, not fit:
#
#   --max-model-len=16384   caps the context; the default 32K KV cache is wasteful
#                           for a model this size and a demo this small.
#   --served-model-name     the id clients pass as "model" in OpenAI requests.
#
# vLLM exposes Prometheus metrics at /metrics on its serving port (:8000) with no
# extra flag, which is what the example's PodMonitor scrapes.
#
# No --port or --host: Modelplane's routing expects the engine on its default
# :8000 with a /health probe, and passes args through verbatim.
apiVersion: modelplane.ai/v1alpha1
kind: ModelDeployment
metadata:
  name: qwen2-5-0-5b
  namespace: ml-team
spec:
  # One replica, matched to any compatible InferenceCluster by device capacity.
  replicas: 1
  engines:
  - name: qwen2-5-0-5b
    members:
    # A single self-contained vLLM pod. The container named "engine" is the
    # inference server; its image and args pass through verbatim.
    - role: Standalone
      nodeSelector:
        devices:
        - name: gpu
          count: 1
          selectors:
          # The model needs almost no VRAM; >=20Gi simply pins it to the L4 pool
          # this example provisions (the L4 reports ~23Gi). DRA evaluates this CEL
          # against the InferenceClass device, then against the GPU's
          # ResourceSlice when it binds the claim.
          - cel: |
              device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("20Gi")) >= 0
      template:
        spec:
          containers:
          - name: engine
            image: vllm/vllm-openai:v0.23.0
            args:
            - "--model=Qwen/Qwen2.5-0.5B-Instruct"
            - "--served-model-name=qwen2.5-0.5b"
            - "--max-model-len=16384"

# Exposes the qwen2-5-0-5b deployment's endpoints as a single OpenAI-compatible
# URL. Modelplane labels each composed ModelEndpoint with the deployment name, so
# this selector reaches every replica. Read the public address from status.address:
#   kubectl get ms qwen2-5-0-5b -n ml-team -o jsonpath='{.status.address}'
apiVersion: modelplane.ai/v1alpha1
kind: ModelService
metadata:
  name: qwen2-5-0-5b
  namespace: ml-team
spec:
  endpoints:
  - selector:
      matchLabels:
        modelplane.ai/deployment: qwen2-5-0-5b

Scraping the metrics

The PodMonitor selects engine pods by the modelplane.ai/serving label Modelplane stamps on them, and the monitoring namespace Prometheus discovers any PodMonitor, so this is the whole config. The engine container port is unnamed, so reference it by number with targetPort:

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: qwen2-5-0-5b-metrics
  namespace: default
spec:
  selector:
    matchExpressions:
    - key: modelplane.ai/serving   # carried by every serving pod
      operator: Exists
  podMetricsEndpoints:
  - targetPort: 8000
    path: /metrics
    interval: 30s

The engine pods and the PodMonitor CRD live on the workload cluster, not the control plane, so apply it there. Then read the metrics from the in-cluster Prometheus over a port-forward:

kubectl apply -f podmonitor.yaml                                  # workload cluster
kubectl -n monitoring port-forward svc/prometheus-prometheus 9090:9090
# open http://localhost:9090, Status > Targets to confirm the scrape, then query
# e.g. vllm:num_requests_running or vllm:gpu_cache_usage_perc

Other engine shapes

The PodMonitor above fits a single-pod vLLM engine. The selector and port shift by shape:

SGLang: exposes /metrics only when the engine runs with --enable-metrics; otherwise it’s identical (same selector, targetPort: 8000).
Leader/worker: only the leader serves the API and carries modelplane.ai/serving, so the selector above already scrapes the leader alone; the workers expose nothing.
prefill/decode: two engines, labelled llm-d.ai/role: prefill and llm-d.ai/role: decode. The prefill engine serves on 8000; the decode engine sits behind the routing sidecar that takes 8000 and listens on 8001, so scrape decode with targetPort: 8001. Select each by its role label to keep them apart.