📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉
CUDA Kernel Benchmarking Library
Mirage: Automatically Generating Fast GPU Kernels without Programming in Triton/CUDA
Fast CUDA Kernels for ResNet Inference.
KernelBench: Can LLMs Write GPU Kernels? - Benchmark with Torch -> CUDA problems
CUDA 开发人员使用的示例,演示了 CUDA 工具包中的功能
Pytorch Custom CUDA kernel for searchsorted
Torch7 bindings for cuda-convnet2 kernels!
Using custom CUDA kernels with Open CV Mat objects.
Get down and dirty with FlashAttention2.0 in pytorch, plug in and play no complex CUDA kernels
CudaPAD is a PTX/SASS viewer for NVIDIA Cuda kernels and provides an on-the-fly view of the assembly.
FlashMLA: Efficient MLA decoding kernels
Embree ray tracing kernels repository.
Efficient Triton Kernels for LLM Training
Tile primitives for speedy kernels
Collections of Apollo Kernels