Llama-3.1-8B
On this page
An 8B dense chat model on a single NVIDIA L4. The entry recipe: one Standalone
engine, no cache, public weights from a Hugging Face mirror. It carries no
clusterSelector, so device capacity alone matches it to any compatible L4 in
the fleet.
This recipe was run end to end on GKE; the InferenceClass, InferenceCluster,
and ModelDeployment are the exact manifests from that run. The EKS platform
shape is the standard single-L4 recipe. It passes server validation but was not
served in this run. Apply the platform side first, then the ML side. The GKE
InferenceCluster carries a GCP project placeholder to edit before applying.
Platform
inference-class-eks.yaml
# InferenceClass for the L4 shape on EKS, validated serving Llama-3.1-8B.
#
# One NVIDIA L4 on a g6.2xlarge. The GPU is declared as a DRA device: the
# scheduler matches a ModelDeployment's nodeSelector against this capacity, then
# DRA binds the physical GPU to the serving pod.
apiVersion: modelplane.ai/v1alpha1
kind: InferenceClass
metadata:
name: eks-l4-1x-g6
spec:
description: "EKS g6.2xlarge, 1x NVIDIA L4"
provisioning:
provider: EKS
eks:
instanceType: g6.2xlarge
diskSizeGb: 100
accelerator:
type: nvidia-l4
count: 1
devices:
- name: gpu
claim: DRA
driver: gpu.nvidia.com
deviceClassName: gpu.nvidia.com
count: 1
attributes:
architecture: { string: Ada Lovelace }
capacity:
# The L4's real usable VRAM as the NVIDIA DRA driver reports it, not the
# nominal 24GB.
memory: { value: "23034Mi" }
inference-cluster-eks.yaml
# EKS InferenceCluster with one L4 node pool. No clusterSelector targets it; the
# ModelDeployment matches on device capacity alone, so it lands here or on any
# other compatible cluster in the fleet.
apiVersion: modelplane.ai/v1alpha1
kind: InferenceCluster
metadata:
name: eks-l4-single
labels:
modelplane.ai/cloud: eks
modelplane.ai/region: us-west
spec:
cluster:
source: EKS
eks:
region: us-west-2
nodePools:
- name: gpu-l4
className: eks-l4-1x-g6
nodeCount: 1
zones:
- us-west-2a
minNodeCount: 0
maxNodeCount: 4
inference-class-gke.yaml
# InferenceClass for the L4 shape on GKE, validated serving Llama-3.1-8B.
#
# One NVIDIA L4 on a g2-standard-8. The GPU is declared as a DRA device: the
# scheduler matches a ModelDeployment's nodeSelector against this capacity, then
# DRA binds the physical GPU to the serving pod.
apiVersion: modelplane.ai/v1alpha1
kind: InferenceClass
metadata:
name: gke-l4-1x-g2
spec:
description: "GKE g2-standard-8, 1x NVIDIA L4"
provisioning:
provider: GKE
gke:
machineType: g2-standard-8
diskSizeGb: 100
accelerator:
type: nvidia-l4
count: 1
devices:
- name: gpu
claim: DRA
driver: gpu.nvidia.com
deviceClassName: gpu.nvidia.com
count: 1
attributes:
architecture: { string: Ada Lovelace }
capacity:
# The L4's real usable VRAM as the NVIDIA DRA driver reports it, not the
# nominal 24GB.
memory: { value: "23034Mi" }
inference-cluster-gke.yaml
# GKE InferenceCluster with one L4 node pool. Replace the project ID before
# applying. No clusterSelector targets it; the ModelDeployment matches on device
# capacity alone, so it lands here or on any other compatible cluster.
apiVersion: modelplane.ai/v1alpha1
kind: InferenceCluster
metadata:
name: gke-l4-single
labels:
modelplane.ai/cloud: gke
modelplane.ai/region: us-central
spec:
cluster:
source: GKE
gke:
project: my-gcp-project # Replace with your GCP project ID.
region: us-central1
nodePools:
- name: gpu-l4
className: gke-l4-1x-g2
nodeCount: 1
zones:
- us-central1-a
minNodeCount: 0
maxNodeCount: 4
bash
curl -fsSL https://docs.modelplane.ai/examples/examples/llama-3.1-8b/inference-cluster-gke.yaml \
| sed 's/my-gcp-project//' \
| kubectl apply -f -Deployment
model-deployment.yaml
# Llama-3.1-8B Instruct served on a single NVIDIA L4 by vLLM, validated end to
# end on GKE (the model layer is cloud-agnostic; the same manifest serves on EKS).
#
# 8B in bf16 is ~16Gi of weights, leaving room for the KV cache on the L4's
# ~23Gi. Llama's default context is 128K, whose KV cache does not fit beside the
# weights, so --max-model-len caps it at 8192 - raise it only as far as the
# leftover VRAM allows.
#
# Weights come from the public NousResearch mirror, so no Hugging Face token is
# needed. The gated meta-llama/Llama-3.1-8B-Instruct original needs an hf-token
# Secret on the *workload* cluster (the engine pod reads it, not the control
# plane) and HF_TOKEN passed on the engine container.
#
# No --port or --host: Modelplane's routing expects the engine on its default
# :8000 with a /health probe, and passes args through verbatim.
apiVersion: modelplane.ai/v1alpha1
kind: ModelDeployment
metadata:
name: llama-3-1-8b
namespace: ml-team
spec:
# One replica, matched to any compatible InferenceCluster by device capacity.
replicas: 1
engines:
- name: llama
members:
# A single self-contained vLLM pod. The container named "engine" is the
# inference server; its image and args pass through verbatim.
- role: Standalone
nodeSelector:
devices:
- name: gpu
count: 1
selectors:
# An 8B model needs most of an L4. >=20Gi selects the L4 (which
# reports ~23Gi) without over-constraining. DRA evaluates this CEL
# against the InferenceClass device, then against the GPU's
# ResourceSlice when it binds the claim.
- cel: |
device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("20Gi")) >= 0
template:
spec:
containers:
- name: engine
image: vllm/vllm-openai:v0.7.3
args:
- "--model=NousResearch/Meta-Llama-3.1-8B-Instruct"
# The id clients pass as "model" in OpenAI requests.
- "--served-model-name=llama-3.1-8b"
# Cap the context so the KV cache fits beside the weights on the L4.
- "--max-model-len=8192"
model-service.yaml
# Exposes the llama-3-1-8b deployment's endpoints as a single OpenAI-compatible
# URL. Modelplane labels each composed ModelEndpoint with the deployment name, so
# this selector reaches every replica. Read the public address from
# status.address:
# kubectl get ms llama-3-1-8b -n ml-team -o jsonpath='{.status.address}'
apiVersion: modelplane.ai/v1alpha1
kind: ModelService
metadata:
name: llama-3-1-8b
namespace: ml-team
spec:
endpoints:
- selector:
matchLabels:
modelplane.ai/deployment: llama-3-1-8b