vLLM
High-throughput LLM serving with PagedAttention, continuous batching, and OpenAI-compatible APIs for GPU clusters.
Why it is included
Production-grade open stack for serving Hugging Face–style models with strong throughput defaults.
Best for
Teams self-hosting chat/completions APIs on NVIDIA (and growing) accelerators.
Strengths
- PagedAttention
- OpenAI API shape
- Broad model support
Limitations
- GPU-centric; ops complexity at scale
Good alternatives
SGLang · TensorRT-LLM · llama.cpp
Related tools
AI & Machine Learning
llama.cpp
Plain C/C++ inference for LLaMA-class models with broad community backends.
AI & Machine Learning
Ollama
Local LLM runner and model library with simple CLI and API for workstation inference.
AI & Machine Learning
Hugging Face Transformers
State-of-the-art pretrained models for PyTorch, TensorFlow, and JAX.
AI & Machine Learning
SGLang
Structured generation language for fast serving: RadixAttention, constrained decoding, and multi-turn batching for frontier-class workloads.
AI & Machine Learning
rtp-llm
Alibaba’s high-performance LLM inference engine (CUDA-focused) for production serving of diverse decoder architectures.
AI & Machine Learning
TensorRT-LLM
NVIDIA TensorRT–based library for optimized LLM inference on GPUs with multi-GPU and speculative decoding features.
