Skip to content
OpenCatalogcurated by FLOSSK

Browse & filter

Filter by platform, license text, maturity, maintenance cadence, and editorial tags like privacy-focused or self-hosted. Search matches names, summaries, tags, and use cases.

18 tools match your filters

Local LLM runner and model library with simple CLI and API for workstation inference.

llmlocalinference

High-throughput LLM serving with PagedAttention, continuous batching, and OpenAI-compatible APIs for GPU clusters.

llminferenceservinggpuapi

Structured generation language for fast serving: RadixAttention, constrained decoding, and multi-turn batching for frontier-class workloads.

llminferenceservinggpustructured-output

Apple MLX-based LLM inference and training on Apple silicon: efficient Metal-backed transformers and examples for local chat models.

llmapple-siliconinferencemetallocal

Single-file distributable LLM weights + llama.cpp runtime: run large models from one executable with broad OS CPU/GPU support.

llmlocalinferenceportable

Memory-efficient CUDA inference kernels for quantized Llama-class models—popular in consumer GPU chat UIs.

llminferencecudaquantizationlocal

NVIDIA TensorRT–based library for optimized LLM inference on GPUs with multi-GPU and speculative decoding features.

llminferencenvidiatensorrtgpu

Cross-platform inference accelerator for ONNX models: CPU, GPU, and mobile execution providers with graph optimizations.

inferenceonnxdeploymentoptimization

Intel toolkit to optimize and deploy deep learning on Intel CPUs, GPUs, and NPUs with model conversion and runtime APIs.

inferenceinteledgeoptimization

CTranslate2 reimplementation of Whisper for faster CPU/GPU inference with lower memory use than reference PyTorch.

speechasrinferenceoptimization

Alibaba’s lightweight inference engine for mobile and edge—used for on-device LLMs and classic CV models with aggressive optimization.

inferenceedgemobilellmtaaft-repositories

Alibaba’s high-performance LLM inference engine (CUDA-focused) for production serving of diverse decoder architectures.

llminferenceservinggputaaft-repositories

NVIDIA research-oriented toolkit for LLM KV-cache compression to stretch context within fixed VRAM budgets.

llmkv-cachecompressioninferencetaaft-repositories

Flexible, high-performance serving system for TensorFlow (and related) models with versioning, batching, and gRPC/REST.

servingtensorflowinferencegrpctaaft-repositories

TypeScript/JavaScript libraries to call Inference API, manage Hub assets, and build browser or Node AI features.

huggingfacejavascripttypescriptinferencetaaft-repositories