Fast implementation of BERT inference directly on NVIDIA (CUDA, CUBLAS) and Intel MKL
Wheels for llama-cpp-python compiled with cuBLAS support
📚Tensor/CUDA Cores, 📖150+ CUDA Kernels, ⚡️⚡️toy-hgemm library with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS 🎉🎉).
Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.
Julia interface to CUBLAS
Code for testing the native float16 matrix multiplication performance on Tesla P100 and V100 GPU based on cublasHgemm
🔥🔥🔥 A collection of some awesome public CUDA, cuBLAS, TensorRT and High Performance Computing (HPC) projects.
code for benchmarking GPU performance based on cublasSgemm and cublasHgemm
Deep Learning library using GPU(CUDA/cuBLAS)