flash-attention · GitHub Topics

QwenLM / Qwen

#自然语言处理#通义千问-7B（Qwen-7B）是阿里云研发的通义千问大模型系列的70亿参数规模的模型

中文 large-language-models 自然语言处理 flash-attention 大语言模型 pretrained-models

Python 17.84 k

16 天前

ymcui / Chinese-LLaMA-Alpaca-2

#自然语言处理#中文LLaMA-2 & Alpaca-2大模型二期项目 + 64K超长上下文模型 (Chinese LLaMA-2 & Alpaca-2 LLMs with 64K long context models)

alpaca llama 大语言模型 llama-2 large-language-models 自然语言处理 alpaca-2 flash-attention llama2 alpaca2 Yarn rlhf

Python 7.16 k

7 个月前

InternLM / InternLM

#大语言模型#Official release of InternLM series (InternLM, InternLM2, InternLM2.5, InternLM3).

聊天机器人 gpt large-language-model long-context rlhf fine-tuning-llm 大语言模型中文 flash-attention pretrained-models

Python 6.86 k

2 个月前

xlite-dev / Awesome-LLM-Inference

📚A curated list of Awesome LLM/VLM🔥 Inference Papers with codes: WINT8/4, FlashAttention, PagedAttention, MLA, Parallelism etc.

flash-attention tensorrt-llm vllm llm-inference deepseek deepseek-v3 deepseek-r1

Python 3.82 k

1 天前

xlite-dev / CUDA-Learn-Notes

📚Modern CUDA Learn Notes: 200+ Tensor/CUDA Cores Kernels🎉, HGEMM, FA2 via MMA and CuTe, 98~100% TFLOPS of cuBLAS/FA2.

CUDA gemm cuda-kernels cuda-programming cudnn cutlass flash-attention

Cuda 3.39 k

13 小时前

flashinfer-ai / flashinfer

FlashInfer: Kernel Library for LLM Serving

flash-attention gpu CUDA PyTorch llm-inference jit

Cuda 2.63 k

2 天前

MoonshotAI / MoBA

#大语言模型#MoBA: Mixture of Block Attention for Long-Context LLMs

flash-attention 大语言模型 llm-serving llm-training moe PyTorch transformer

Python 1.73 k

10 天前

InternLM / InternEvo

InternEvo is an open-sourced lightweight training framework aims to support model pre-training without the need for extensive dependencies.

gemma internlm internlm2 llama3 llava llm-framework llm-training multi-modal pipeline-parallelism flash-attention PyTorch

Python 373

2 天前

DAMO-NLP-SG / Inf-CLIP

[CVPR 2025 Highlight] The official CLIP training codebase of Inf-CL: "Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss". A super memory-efficiency CLIP training schem...

contrastive-learning flash-attention memory-efficient clip

Python 240

3 个月前

xlite-dev / ffpa-attn-mma

📚FFPA(Split-D): Yet another Faster Flash Attention with O(1) GPU SRAM complexity large headdim, 1.8x~3x↑🎉 faster than SDPA EA.

attention CUDA flash-attention mlsys deepseek deepseek-r1 deepseek-v3

Cuda 164

7 天前

alexzhang13 / flashattention2-custom-mask

#计算机科学#Triton implementation of FlashAttention2 that adds Custom Masks.

attention attention-mechanism cuda-kernels 深度学习 flash-attention triton

Python 108

8 个月前

CoinCheung / gdGPT

#自然语言处理#Train llm (bloom, llama, baichuan2-7b, chatglm3-6b) with deepspeed pipeline mode. Faster than zero/zero++/fsdp.

deepspeed 大语言模型 pipeline 自然语言处理 PyTorch bloom flash-attention baichuan2-7b mixtral-8x7b llama2

Python 95

1 年前

Bruce-Lee-LY / decoding_attention

#大语言模型#Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.

CUDA gpu inference 大语言模型 mha Nvidia large-language-model flash-attention

C++ 35

11 天前

Bruce-Lee-LY / flash_attention_inference

#大语言模型#Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.

CUDA flash-attention gpu inference large-language-model 大语言模型 Nvidia cutlass mha

C++ 35

1 个月前

kklemon / FlashPerceiver

#自然语言处理#Fast and memory efficient PyTorch implementation of the Perceiver with FlashAttention.

attention-mechanism 深度学习 flash-attention 自然语言处理 transformer

Python 25

5 个月前

RulinShao / FastCkpt

Python package for rematerialization-aware gradient checkpointing

flash-attention

Python 24

1 年前

erfanzar / jax-flash-attn2

A flexible and efficient implementation of Flash Attention 2.0 for JAX, supporting multiple backends (GPU/TPU/CPU) and platforms (Triton/Pallas/JAX).

flash-attention jax

Python 23

1 个月前

Naman-ntc / FastCode

Utilities for efficient fine-tuning, inference and evaluation of code generation models

code-generation efficient finetuning inference transformers flash-attention

Python 21

2 年前

kyegomez / FlashMHA

An simple pytorch implementation of Flash MultiHead Attention

人工智能 artificial-neural-networks attention attention-mechanisms gpt4 transformer flash-attention

Jupyter Notebook 21

1 年前

AI-DarwinLabs / amd-mi300-ml-stack

#计算机科学#🚀 Automated deployment stack for AMD MI300 GPUs with optimized ML/DL frameworks and HPC-ready configurations

conda 深度学习 deepspeed flash-attention gpu-computing hpc 机器学习 slurm rocm

Shell 10

4 个月前