Skip to content

llm-d — Kubernetes-Native LLM Inference

llm-d is a Kubernetes-native, high-performance distributed LLM inference framework. It provides the fastest time-to-value and competitive performance per dollar for serving language models across diverse hardware accelerators.


Why llm-d for AI-native Research?

LLM inference platforms are ideal proving grounds for the AI-native approach. They are:

  • Complex and multi-component — schedulers, KV caches, model servers, routing layers
  • Under constant pressure — new models, new hardware, shifting traffic patterns
  • Multi-objective — latency, throughput, cost, and fairness must be traded off
  • Observable — rich telemetry from every request through the pipeline

Key Capabilities

  • Inference scheduling — intelligent request routing and scheduling
  • KV-cache optimization — hierarchical offloading, cache-aware LoRA routing
  • Prefill/Decode disaggregation — separating compute phases for efficiency
  • Wide Expert Parallelism — load balancing across MoE experts
  • Scale-to-zero autoscaling — resource-efficient scaling
  • Resilient networking with UCCL — reliable distributed communication

AI-Native Opportunities

In the AI-native vision, the inference platform continuously observes its own behavior — request latencies, queue depths, cache hit rates, GPU utilization — and uses this telemetry to drive improvements. These improvements span:

  • Configuration tuning — batch sizes, scheduling policies, cache parameters
  • Code changes — algorithmic improvements to routing, scheduling, or memory management
  • Structural evolution — new components, refactored interfaces, optimized pipelines

The Controlling System reasons about multi-objective tradeoffs and experiments with alternatives before deploying validated changes.


Resources

We use cookieless Google Analytics to count how many readers each post gets — no cookies, no tracking across sites. Your page URL (without query parameters), browser, and approximate location may be processed. Read what's collected →