llm-d — Kubernetes-Native LLM Inference¶

llm-d is a Kubernetes-native, high-performance distributed LLM inference framework. It provides the fastest time-to-value and competitive performance per dollar for serving language models across diverse hardware accelerators.

Why llm-d for AI-native Research?¶

LLM inference platforms are ideal proving grounds for the AI-native approach. They are:

Complex and multi-component — schedulers, KV caches, model servers, routing layers
Under constant pressure — new models, new hardware, shifting traffic patterns
Multi-objective — latency, throughput, cost, and fairness must be traded off
Observable — rich telemetry from every request through the pipeline

Key Capabilities¶

Inference scheduling — intelligent request routing and scheduling
KV-cache optimization — hierarchical offloading, cache-aware LoRA routing
Prefill/Decode disaggregation — separating compute phases for efficiency
Wide Expert Parallelism — load balancing across MoE experts
Scale-to-zero autoscaling — resource-efficient scaling
Resilient networking with UCCL — reliable distributed communication

AI-Native Opportunities¶

In the AI-native vision, the inference platform continuously observes its own behavior — request latencies, queue depths, cache hit rates, GPU utilization — and uses this telemetry to drive improvements. These improvements span:

Configuration tuning — batch sizes, scheduling policies, cache parameters
Code changes — algorithmic improvements to routing, scheduling, or memory management
Structural evolution — new components, refactored interfaces, optimized pipelines

The Controlling System reasons about multi-objective tradeoffs and experiments with alternatives before deploying validated changes.

Resources¶

Website: llm-d.ai
GitHub: github.com/llm-d