High-throughput LLM serving with PagedAttention, continuous batching, and OpenAI-compatible APIs for GPU clusters.
Browse & filter
Filter by platform, license text, maturity, maintenance cadence, and editorial tags like privacy-focused or self-hosted. Search matches names, summaries, tags, and use cases.
7 tools match your filters
Structured generation language for fast serving: RadixAttention, constrained decoding, and multi-turn batching for frontier-class workloads.
Distributed compute framework for Python: scale data loading, training, hyperparameter search, and online serving (Ray Serve).
Multi-framework inference server for TensorRT, ONNX, PyTorch, Python backends—dynamic batching, ensembles, and GPU sharing.
Unified model serving and deployment toolkit: package models as APIs, ship to Kubernetes, and manage runtimes.
Alibaba’s high-performance LLM inference engine (CUDA-focused) for production serving of diverse decoder architectures.
Flexible, high-performance serving system for TensorFlow (and related) models with versioning, batching, and gRPC/REST.
