Browse & filter

Filter by platform, license text, maturity, maintenance cadence, and editorial tags like privacy-focused or self-hosted. Search matches names, summaries, tags, and use cases.

vLLM

Top pick

High-throughput LLM serving with PagedAttention, continuous batching, and OpenAI-compatible APIs for GPU clusters.

llminferenceservinggpuapi

AI & Machine Learning

SGLang

Also strong

Structured generation language for fast serving: RadixAttention, constrained decoding, and multi-turn batching for frontier-class workloads.

llminferenceservinggpustructured-output

AI & Machine Learning

Ray

Top pick

Distributed compute framework for Python: scale data loading, training, hyperparameter search, and online serving (Ray Serve).

distributedpythonservingtraining

AI & Machine Learning

NVIDIA Triton Inference Server

Top pick

Multi-framework inference server for TensorRT, ONNX, PyTorch, Python backends—dynamic batching, ensembles, and GPU sharing.

servinginferencegpunvidiakubernetes

AI & Machine Learning

BentoML

Also strong

Unified model serving and deployment toolkit: package models as APIs, ship to Kubernetes, and manage runtimes.

servingdeploymentmlopsapi

AI & Machine Learning

rtp-llm

Honorable mention

Alibaba’s high-performance LLM inference engine (CUDA-focused) for production serving of diverse decoder architectures.

llminferenceservinggputaaft-repositories

AI & Machine Learning

TensorFlow Serving

Also strong

Flexible, high-performance serving system for TensorFlow (and related) models with versioning, batching, and gRPC/REST.

servingtensorflowinferencegrpctaaft-repositories