rtp-llm
Alibaba’s high-performance LLM inference engine (CUDA-focused) for production serving of diverse decoder architectures.
Why it is included
Appears in TAAFT’s #llm repository listings as Alibaba’s open serving-oriented stack.
Best for
GPU inference teams evaluating alternatives to vLLM/Triton for datacenter LLM APIs.
Strengths
- Serving-oriented
- Active Alibaba maintenance
Limitations
- Primarily NVIDIA CUDA; ops patterns less universal than vLLM docs
Good alternatives
vLLM · TensorRT-LLM · SGLang
Related tools
AI & Machine Learning
vLLM
High-throughput LLM serving with PagedAttention, continuous batching, and OpenAI-compatible APIs for GPU clusters.
AI & Machine Learning
TensorRT-LLM
NVIDIA TensorRT–based library for optimized LLM inference on GPUs with multi-GPU and speculative decoding features.
AI & Machine Learning
SGLang
Structured generation language for fast serving: RadixAttention, constrained decoding, and multi-turn batching for frontier-class workloads.
AI & Machine Learning
NVIDIA Triton Inference Server
Multi-framework inference server for TensorRT, ONNX, PyTorch, Python backends—dynamic batching, ensembles, and GPU sharing.
AI & Machine Learning
MNN
Alibaba’s lightweight inference engine for mobile and edge—used for on-device LLMs and classic CV models with aggressive optimization.
AI & Machine Learning
KVPress
NVIDIA research-oriented toolkit for LLM KV-cache compression to stretch context within fixed VRAM budgets.
