DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
#大语言模型#Tutel MoE: Optimized Mixture-of-Experts Library, Support DeepSeek FP8/FP4
PyTorch extension for emulating FP8 data formats on standard FP32 Xeon/GPU hardware.
SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime
🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
Accelerate LLM with low-bit (FP4 / INT4 / FP8 / INT8) optimizations using ipex-llm
#计算机科学#A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper, Ada and Blackwell GPUs, to provide better performance with lower memory ut...
使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention
End-to-end recipes for optimizing diffusion models with torchao and diffusers (inference and FP8 training).