#计算机科学#Fast inference engine for Transformer models
📚Modern CUDA Learn Notes: 200+ Tensor/CUDA Cores Kernels🎉, HGEMM, FA2 via MMA and CuTe, 98~100% TFLOPS of cuBLAS/FA2.
Multi-Threaded FP32 Matrix Multiplication on x86 CPUs
Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.
#计算机科学#The HPC toolbox: fused matrix multiplication, convolution, data-parallel strided tensor primitives, OpenMP facilities, SIMD, JIT Assembler, CPU detection, state-of-the-art vectorized BLAS for floats a...
#大语言模型#🚀🚀🚀 This repository lists some awesome public CUDA, cuda-python, cuBLAS, cuDNN, CUTLASS, TensorRT, TensorRT-LLM, Triton, TVM, MLIR, PTX and High Performance Computing (HPC) projects.
#计算机科学#Stretching GPU performance for GEMMs and tensor contractions.
DBCSR: Distributed Block Compressed Sparse Row matrix library
#计算机科学#hipBLASLt is a library that provides general matrix-matrix operations with a flexible API and extends functionalities beyond a traditional BLAS library
A Flexible and Energy Efficient Accelerator For Sparse Convolution Neural Network
PyTorch half precision gemm lib w/ fused optional bias + optional relu/gelu
Serial and parallel implementations of matrix multiplication