Browse & filter

Filter by platform, license text, maturity, maintenance cadence, and editorial tags like privacy-focused or self-hosted. Search matches names, summaries, tags, and use cases.

Ollama

Top pick

Local LLM runner and model library with simple CLI and API for workstation inference.

llmlocalinference

AI & Machine Learning

llama.cpp

Top pick

Plain C/C++ inference for LLaMA-class models with broad community backends.

llminferencec++local

AI & Machine Learning

vLLM

Top pick

High-throughput LLM serving with PagedAttention, continuous batching, and OpenAI-compatible APIs for GPU clusters.

llminferenceservinggpuapi

AI & Machine Learning

SGLang

Also strong

Structured generation language for fast serving: RadixAttention, constrained decoding, and multi-turn batching for frontier-class workloads.

llminferenceservinggpustructured-output

AI & Machine Learning

MLX LM

Also strong

Apple MLX-based LLM inference and training on Apple silicon: efficient Metal-backed transformers and examples for local chat models.

llmapple-siliconinferencemetallocal

AI & Machine Learning

llamafile

Honorable mention

Single-file distributable LLM weights + llama.cpp runtime: run large models from one executable with broad OS CPU/GPU support.

llmlocalinferenceportable

AI & Machine Learning

ExLlamaV2

Honorable mention

Memory-efficient CUDA inference kernels for quantized Llama-class models—popular in consumer GPU chat UIs.

llminferencecudaquantizationlocal

AI & Machine Learning

TensorRT-LLM

Also strong

NVIDIA TensorRT–based library for optimized LLM inference on GPUs with multi-GPU and speculative decoding features.

llminferencenvidiatensorrtgpu

AI & Machine Learning

ONNX Runtime

Top pick

Cross-platform inference accelerator for ONNX models: CPU, GPU, and mobile execution providers with graph optimizations.

inferenceonnxdeploymentoptimization

AI & Machine Learning

OpenVINO

Also strong

Intel toolkit to optimize and deploy deep learning on Intel CPUs, GPUs, and NPUs with model conversion and runtime APIs.

inferenceinteledgeoptimization

AI & Machine Learning

NVIDIA Triton Inference Server

Top pick

Multi-framework inference server for TensorRT, ONNX, PyTorch, Python backends—dynamic batching, ensembles, and GPU sharing.

servinginferencegpunvidiakubernetes

AI & Machine Learning

faster-whisper

Also strong

CTranslate2 reimplementation of Whisper for faster CPU/GPU inference with lower memory use than reference PyTorch.

speechasrinferenceoptimization

AI & Machine Learning

MNN

Also strong

Alibaba’s lightweight inference engine for mobile and edge—used for on-device LLMs and classic CV models with aggressive optimization.

inferenceedgemobilellmtaaft-repositories

AI & Machine Learning

rtp-llm

Honorable mention

Alibaba’s high-performance LLM inference engine (CUDA-focused) for production serving of diverse decoder architectures.

llminferenceservinggputaaft-repositories

AI & Machine Learning

KVPress

Honorable mention

NVIDIA research-oriented toolkit for LLM KV-cache compression to stretch context within fixed VRAM budgets.

llmkv-cachecompressioninferencetaaft-repositories

AI & Machine Learning

TensorFlow Serving

Also strong

Flexible, high-performance serving system for TensorFlow (and related) models with versioning, batching, and gRPC/REST.

servingtensorflowinferencegrpctaaft-repositories

AI & Machine Learning

Hugging Face.js

Also strong

TypeScript/JavaScript libraries to call Inference API, manage Hub assets, and build browser or Node AI features.

huggingfacejavascripttypescriptinferencetaaft-repositories

AI & Machine Learning

Text Embeddings Inference

Top pick

Rust-based high-throughput server for sentence-transformers–class embedding models with GPU/CPU backends.

embeddingsinferenceragrusttaaft-repositories