Local LLM runner and model library with simple CLI and API for workstation inference.
Browse & filter
Filter by platform, license text, maturity, maintenance cadence, and editorial tags like privacy-focused or self-hosted. Search matches names, summaries, tags, and use cases.
54 tools match your filters
Plain C/C++ inference for LLaMA-class models with broad community backends.
High-throughput LLM serving with PagedAttention, continuous batching, and OpenAI-compatible APIs for GPU clusters.
Structured generation language for fast serving: RadixAttention, constrained decoding, and multi-turn batching for frontier-class workloads.
Unified OpenAI-compatible proxy and SDK for 100+ model providers (local, cloud, Bedrock, Azure) with budgets, fallbacks, and logging.
Apple MLX-based LLM inference and training on Apple silicon: efficient Metal-backed transformers and examples for local chat models.
Single-file distributable LLM weights + llama.cpp runtime: run large models from one executable with broad OS CPU/GPU support.
Universal deployment stack compiling models to Vulkan, Metal, CUDA, and WebGPU via TVM/Unity for phones, browsers, and servers.
Memory-efficient CUDA inference kernels for quantized Llama-class models—popular in consumer GPU chat UIs.
NVIDIA TensorRT–based library for optimized LLM inference on GPUs with multi-GPU and speculative decoding features.
YAML-configured fine-tuning for LLMs: LoRA, QLoRA, FSDP, and many architectures on top of Hugging Face trainers.
Optimized fine-tuning library claiming 2× faster LoRA/QLoRA with less VRAM via custom kernels and Hugging Face compatibility.
Meta’s Llama family of open **weights** (subject to Llama license) with reference code, tooling, and downloads via Hugging Face and meta-llama org.
Mistral’s open-weight checkpoints (e.g. 7B era, Mixtral MoE) and Apache-2.0–licensed **code** alongside proprietary flagship lines—verify each checkpoint.
Alibaba’s Qwen family (dense and MoE) with strong multilingual and coding variants; weights and code on Hugging Face under stated licenses per release.
DeepSeek open-weight models (e.g. V3/R1 lineage) with MIT or custom terms per release—high capability coding and reasoning checkpoints.
Google’s smaller open **weights** Gemma line (Gemma 2/3, etc.) with Gemma license terms, plus `gemma.cpp` for lightweight CPU inference.
Small language model family (Phi-3/4 lineage) emphasizing strong quality per parameter; weights on Hugging Face under Microsoft licenses per release.
Technology Innovation Institute Falcon open weights (7B–180B era) under Apache-2.0 weights for many releases—landmark UAE-led open model line.
RNN-meets-transformer linear-attention LM architecture running with O(n) memory—unique open line for long-context and embedded inference.
01.AI Yi open-weight bilingual models (EN/ZH focus) with Apache-2.0 or Yi license per checkpoint on Hugging Face.
1.1B-parameter Llama-architecture model trained on ~3T tokens—Apache-2.0 weights for fast experiments and teaching.
Allen AI fully open LLM **pipeline**: weights, training code, data mixes, and evaluation—research transparency flagship.
BigScience 176B multilingual causal LM—landmark collaborative open training effort on Jean Zay (weights under BigScience Responsible AI License).
EleutherAI framework and 20B-class models for training large autoregressive LMs with 3D parallelism—Apache-2.0 training stack.
Hugging Face TB small LM family (135M–1.7B) with Apache-2.0 weights aimed at on-device and edge quality per size.
OpenAI’s open-weight GPT-OSS checkpoints (e.g. 20B, 120B) hosted on Hugging Face for local inference and fine-tuning.
Historic decoder-only LM family (124M–1.5B) under `openai-community` on the Hub—still a default tutorial and pipeline test target.
Meta’s Open Pretrained Transformer suite (125M–175B) released with reproducible logbooks—canonical Hub org `facebook` / `facebook/opt-*`.
Early open chat models fine-tuned from Llama-class bases by LMSYS—widely mirrored on the Hub (e.g. Vicuna-7B v1.5).
Z.ai GLM-5–generation checkpoints (e.g. FP8 builds) distributed on the Hub for text generation and agent-style use cases.
EleutherAI’s public scaling suite: matched GPT-NeoX–architecture models from 70M–12B with public datasets for interpretability research.
Alibaba’s Qwen2.5 Coder 7B instruct checkpoint on Hugging Face—optimized for code completion, synthesis, and tooling workflows.
Apple’s OpenELM family—openly released efficient language models with layer-wise scaling and Hub-hosted instruct variants.
NVIDIA Nemotron 3 open model checkpoints (dense and MoE) on Hugging Face for reasoning, coding, and agentic workloads at scale.
BigScience instruction-tuned BLOOM derivatives (e.g. BLOOMZ-560M–176B) for multilingual zero-shot instruction following on the Hub.
Data framework for LLM applications: ingestion, indexing, retrieval, and agents over documents and APIs.
Open-source embedding database focused on developer ergonomics for LLM apps: local dev, server mode, and simple APIs.
Parameter-efficient fine-tuning methods (LoRA, adapters, prompt tuning) integrated with Transformers models.
Transformer Reinforcement Learning: train LLMs with RLHF, DPO, ORPO, and related preference optimization recipes.
Hugging Face library for large shared datasets: memory mapping, streaming, Arrow-backed columns, and Hub integration.
Alibaba’s lightweight inference engine for mobile and edge—used for on-device LLMs and classic CV models with aggressive optimization.
Alibaba’s high-performance LLM inference engine (CUDA-focused) for production serving of diverse decoder architectures.
NVIDIA research-oriented toolkit for LLM KV-cache compression to stretch context within fixed VRAM budgets.
Open-source Svelte/TypeScript app that powers HuggingChat—multi-model chat, tools, and self-hostable UI patterns.
Curated recipes and code for aligning language models (preference optimization, DPO-style flows) on open stacks.
Rust LSP server that plugs LLM-backed completions into editors—designed to pair with local or API models.
Google library to extract structured fields from unstructured text with LLMs, source grounding, and visualization helpers.
ByteDance open agent harness for long-horizon research, coding, and creation with tools, memory, and subagents.
OpenAI’s MIT-licensed Python kit for multi-agent workflows, handoffs, guardrails, and tracing with the Responses API.
DeepSeek Janus series: unified multimodal understanding and generation models with MIT-licensed research code.
Framework for building LLM applications with chains, tools, and agents.
Open toolkit for browser automation driven by LLM agents.
LLM red-teaming framework for jailbreak and prompt-injection testing.
