KVPress
NVIDIA research-oriented toolkit for LLM KV-cache compression to stretch context within fixed VRAM budgets.
Why it is included
Surfaced on TAAFT’s #llm repository tag as an Apache-2.0 KV-cache compression project.
Best for
Experimenters reducing memory footprint of long-context Transformer inference.
Strengths
- Focused problem
- Composable with HF-style stacks
Limitations
- Research-grade; validate quality loss per method and model
Good alternatives
PagedAttention tuning · Quantized KV · Sliding-window models
Related tools
AI & Machine Learning
vLLM
High-throughput LLM serving with PagedAttention, continuous batching, and OpenAI-compatible APIs for GPU clusters.
AI & Machine Learning
Hugging Face Transformers
State-of-the-art pretrained models for PyTorch, TensorFlow, and JAX.
AI & Machine Learning
MNN
Alibaba’s lightweight inference engine for mobile and edge—used for on-device LLMs and classic CV models with aggressive optimization.
AI & Machine Learning
rtp-llm
Alibaba’s high-performance LLM inference engine (CUDA-focused) for production serving of diverse decoder architectures.
AI & Machine Learning
Ollama
Local LLM runner and model library with simple CLI and API for workstation inference.
AI & Machine Learning
llama.cpp
Plain C/C++ inference for LLaMA-class models with broad community backends.
