📚Modern CUDA Learn Notes: 200+ Tensor/CUDA Cores Kernels🎉, HGEMM, FA2 via MMA and CuTe, 98~100% TFLOPS of cuBLAS/FA2.
A fast communication-overlapping library for tensor/expert parallelism on GPUs.
#大语言模型#🚀🚀🚀 This repository lists some awesome public CUDA, cuda-python, cuBLAS, cuDNN, CUTLASS, TensorRT, TensorRT-LLM, Triton, TVM, MLIR, PTX and High Performance Computing (HPC) projects.
#大语言模型#Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
#计算机科学#GEMM and Winograd based convolutions using CUTLASS
A cutlass cute implementation of headdim-64 flashattentionv2 TensorRT plugin for LightGlue. Run on Jetson Orin NX 8GB with TensorRT 8.5.2.
pytorch implements block sparse