Taskflow 助您用现代 C++ 快速编写并行和异构任务程序
📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
Sample codes for my CUDA programming book
Thin, unified, C++-flavored wrappers for the CUDA APIs
#计算机科学#TinyChatEngine: On-Device LLM Inference Library
#计算机科学#Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models
Safe rust wrapper around CUDA toolkit
#大语言模型#LLM notes, including model inference, transformer model structure, and llm framework code analysis notes.
#计算机科学#A self-learning tutorail for CUDA High Performance Programing.
A simple GPU hash table implemented in CUDA using lock free techniques
This is an archive of materials produced for an introductory class on CUDA programming at Stanford University in 2010
#计算机科学#From zero to hero CUDA for accelerating maths and machine learning on GPU.
μ-Cuda, COVER THE LAST MILE OF CUDA. With features: intellisense-friendly, structured launch, automatic cuda graph generation and updating.
An implementation of HIP that works on CPUs, across OSes.
CudaPAD is a PTX/SASS viewer for NVIDIA Cuda kernels and provides an on-the-fly view of the assembly.
#算法刷题#CUDA kernel author's tools
Accelerated General (FP32) Matrix Multiplication from scratch in CUDA
#计算机科学#Install CUDA on Windows11 using WSL2