gemm · GitHub Topics

#计算机科学#Fast inference engine for Transformer models

neural-machine-translation C++mkl quantization CUDA thrust opennmt 深度神经网络 openmp onednn intrinsics avx2 avx parallel-computing gemm neon transformer-models machine-translation 深度学习 inference

C++ 3.74 k

4 天前

xlite-dev / CUDA-Learn-Notes

📚Modern CUDA Learn Notes: 200+ Tensor/CUDA Cores Kernels🎉, HGEMM, FA2 via MMA and CuTe, 98~100% TFLOPS of cuBLAS/FA2.

CUDA gemm cuda-kernels cuda-programming cudnn cutlass flash-attention

Cuda 3.39 k

16 小时前

flame / how-to-optimize-gemm

gemm matrix-multiplication blis

C 1.86 k

2 年前

CNugteren / CLBlast

Tuned OpenCL BLAS

blas opencl blas-libraries matrix-multiplication gemm gpu

C++ 1.09 k

5 个月前

flame / blislab

BLISlab: A Sandbox for Optimizing GEMM

gemm matrix-multiplication blis

C 512

4 年前

Bruce-Lee-LY / cuda_hgemm

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

CUDA gemm cublas Nvidia gpu

Cuda 386

7 个月前

salykova / matmul.c

Multi-Threaded FP32 Matrix Multiplication on x86 CPUs

C gemm matrix-multiplication openmp cpu

C 346

4 天前

yzhaiustc / Optimizing-SGEMM-on-NVIDIA-Turing-GPUs

Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.

CUDA gemm Nvidia optimization

Cuda 337

3 个月前

mratsim / laser

#计算机科学#The HPC toolbox: fused matrix multiplication, convolution, data-parallel strided tensor primitives, OpenMP facilities, SIMD, JIT Assembler, CPU detection, state-of-the-art vectorized BLAS for floats a...

high-performance-computing 深度学习 blas gemm convolution jit Assembly simd openmp tensor parallel matrix-multiplication

Nim 285

1 年前

coderonion / awesome-cuda-and-hpc

#大语言模型#🚀🚀🚀 This repository lists some awesome public CUDA, cuda-python, cuBLAS, cuDNN, CUTLASS, TensorRT, TensorRT-LLM, Triton, TVM, MLIR, PTX and High Performance Computing (HPC) projects.

CUDA cublas tensorrt Awesome Lists 大语言模型 gpu blas PyTorch hpc gemm llama cudnn triton tensorrt-llm cutlass mlir tvm deepseek ptx

240

4 天前

ROCm / Tensile

#计算机科学#Stretching GPU performance for GEMMs and tensor contractions.

gemm blas dnn neural-networks 机器学习 tensors Python opencl hip auto-tuning amd gpu-computing gpu-acceleration gpu matrix-multiplication Assembly

Python 235

1 天前

cp2k / dbcsr

DBCSR: Distributed Block Compressed Sparse Row matrix library

blas matrix-multiplication gemm CUDA sparse-matrix mpi hpc linear-algebra

Fortran 142

3 天前

yzhaiustc / Optimizing-DGEMM-on-Intel-CPUs-with-AVX512F

Stepwise optimizations of DGEMM on CPU, reaching performance faster than Intel MKL eventually, even under multithreading.

blas gemm avx512 simd mkl openmp

C 141

3 年前

yui0 / slibs

Single file libraries for C/C++

C single-header-lib audio flac mp3 gpgpu mpeg mp4 m4a aac glsl opencl gemm blas ascii codec encoder 数学 alsa kms

C 121

8 个月前

ROCm / hipBLASLt

#计算机科学#hipBLASLt is a library that provides general matrix-matrix operations with a flexible API and extends functionalities beyond a traditional BLAS library

amd Assembly blas gemm gpu-computing hip 机器学习 matrix-multiplication rocm

Assembly 86

1 天前