#计算机科学#Fast inference engine for Transformer models
📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
Multi-Threaded FP32 Matrix Multiplication on x86 CPUs
Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.
#计算机科学#The HPC toolbox: fused matrix multiplication, convolution, data-parallel strided tensor primitives, OpenMP facilities, SIMD, JIT Assembler, CPU detection, state-of-the-art vectorized BLAS for floats a...
#计算机科学#Stretching GPU performance for GEMMs and tensor contractions.
#大语言模型#🔥🔥🔥 A collection of some awesome public CUDA, cuBLAS, cuDNN, CUTLASS, TensorRT, TensorRT-LLM, Triton, TVM, MLIR and High Performance Computing (HPC) projects.
DBCSR: Distributed Block Compressed Sparse Row matrix library
#计算机科学#hipBLASLt is a library that provides general matrix-matrix operations with a flexible API and extends functionalities beyond a traditional BLAS library
PyTorch half precision gemm lib w/ fused optional bias + optional relu/gelu
A Flexible and Energy Efficient Accelerator For Sparse Convolution Neural Network
Serial and parallel implementations of matrix multiplication