#大语言模型#LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
#计算机科学#Deep learning in Rust, with shape checked tensors and neural networks
Safe rust wrapper around CUDA toolkit
Simple utilities to enable code reuse and portability between CUDA C/C++ and standard C/C++.
This is an archive of materials produced for an introductory class on CUDA programming at Stanford University in 2010
#计算机科学#From zero to hero CUDA for accelerating maths and machine learning on GPU.
Amplifier allows .NET developers to easily run complex applications with intensive mathematical computation on Intel CPU/GPU, NVIDIA, AMD without writing any additional C kernel code. Write your funct...
Some CUDA design patterns and a bit of template magic for CUDA
Spiking Neural Networks in C++ with strong GPU acceleration through CUDA
#算法刷题#CUDA kernel author's tools
Open source cross-platform compiler for compute-intensive loops used in AI algorithms, from Microsoft Research
#计算机科学#Triton implementation of FlashAttention2 that adds Custom Masks.
#计算机科学#High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.
A tool for examining GPU scheduling behavior.