📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
A fast communication-overlapping library for tensor/expert parallelism on GPUs.
#大语言模型#🔥🔥🔥 A collection of some awesome public CUDA, cuBLAS, cuDNN, CUTLASS, TensorRT, TensorRT-LLM, Triton, TVM, MLIR and High Performance Computing (HPC) projects.
#大语言模型#Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
#计算机科学#GEMM and Winograd based convolutions using CUTLASS
A cutlass cute implementation of headdim-64 flashattentionv2 TensorRT plugin for LightGlue. Run on Jetson Orin NX 8GB with TensorRT 8.5.2.
pytorch implements block sparse