#计算机科学#Fast implementation of BERT inference directly on NVIDIA (CUDA, CUBLAS) and Intel MKL
Wheels for llama-cpp-python compiled with cuBLAS support
Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.
Julia interface to CUBLAS
Code for testing the native float16 matrix multiplication performance on Tesla P100 and V100 GPU based on cublasHgemm
code for benchmarking GPU performance based on cublasSgemm and cublasHgemm
#大语言模型#🚀🚀🚀 This repository lists some awesome public CUDA, cuda-python, cuBLAS, cuDNN, CUTLASS, TensorRT, TensorRT-LLM, Triton, TVM, MLIR, PTX and High Performance Computing (HPC) projects.