Flux diffusion model implementation using quantized fp8 matmul & remaining layers use faster half precision accumulate, which is ~2x faster on consumer devices.
SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime
PyTorch extension for emulating FP8 data formats on standard FP32 Xeon/GPU hardware.
🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
翻译 - 通过多GPU,TPU,混合精度训练和使用PyTorch模型的简单方法
Accelerate LLM with low-bit (FP4 / INT4 / FP8 / INT8) optimizations using ipex-llm
#计算机科学#A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization i...
使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention
End-to-end recipes for optimizing diffusion models with torchao and diffusers (inference and FP8 training).