Transformer Engine
NVIDIA library for FP8/FP4 and fused kernels on Hopper/Ada-class GPUs to accelerate Transformer training and inference.
Why it is included
Listed on TAAFT under NVIDIA repositories tagged machine-learning / LLM acceleration.
Best for
Training and serving frontier Transformers where FP8/FP4 kernels unlock throughput.
Strengths
- FP8/FP4 paths
- Tight PyTorch/JAX integration options
- NVIDIA-optimized
Limitations
- Hardware-specific wins; not portable to all accelerators
Good alternatives
FlashAttention · DeepSpeed · PyTorch AMP alone
Related tools
AI & Machine Learning
DeepSpeed
Microsoft library for extreme-scale model training: ZeRO optimizer states, pipeline parallelism, and inference kernels.
AI & Machine Learning
PyTorch
Deep learning framework with strong research-to-production paths.
AI & Machine Learning
TensorRT-LLM
NVIDIA TensorRT–based library for optimized LLM inference on GPUs with multi-GPU and speculative decoding features.
AI & Machine Learning
Axolotl
YAML-configured fine-tuning for LLMs: LoRA, QLoRA, FSDP, and many architectures on top of Hugging Face trainers.
AI & Machine Learning
Unsloth
Optimized fine-tuning library claiming 2× faster LoRA/QLoRA with less VRAM via custom kernels and Hugging Face compatibility.
AI & Machine Learning
OLMo
Allen AI fully open LLM **pipeline**: weights, training code, data mixes, and evaluation—research transparency flagship.
