llama.cpp
Plain C/C++ inference for LLaMA-class models with broad community backends.
Why it is included
Reference-quality local inference stack powering countless GUIs and servers.
Best for
Embedding LLMs into apps, edge devices, and research sandboxes.
If you use Windows, Mac, or paid tools
Local LLM alternative to OpenAI, Anthropic, and Google cloud APIs when you run models on your hardware.
Strengths
- Performance focus
- Quantization ecosystem
- Hardware breadth
Limitations
- You must comply with each model’s license
Good alternatives
Ollama · MLC
Related tools
AI & Machine Learning
PyTorch
Deep learning framework with strong research-to-production paths.
AI & Machine Learning
Ollama
Local LLM runner and model library with simple CLI and API for workstation inference.
AI & Machine Learning
MLX LM
Apple MLX-based LLM inference and training on Apple silicon: efficient Metal-backed transformers and examples for local chat models.
AI & Machine Learning
llamafile
Single-file distributable LLM weights + llama.cpp runtime: run large models from one executable with broad OS CPU/GPU support.
AI & Machine Learning
ExLlamaV2
Memory-efficient CUDA inference kernels for quantized Llama-class models—popular in consumer GPU chat UIs.
AI & Machine Learning
vLLM
High-throughput LLM serving with PagedAttention, continuous batching, and OpenAI-compatible APIs for GPU clusters.
