nccl · GitHub Topics

cupy / cupy

NumPy & SciPy for GPU

CUDA cudnn cublas cusolver nccl Python NumPy cupy curand cusparse gpu SciPy tensor rocm

Python 10.3 k

4 天前

coreylowman / cudarc

Safe rust wrapper around CUDA toolkit

CUDA cuda-programming gpu gpu-acceleration Rust cublas curand cuda-kernels cudnn nccl

Rust 862

3 天前

huggingface / llm_training_handbook

#自然语言处理#An open collection of methodologies to help with successful training of large language models.

CUDA large-language-models 大语言模型 nccl 自然语言处理 performance Python PyTorch scalability troubleshooting

Python 500

1 年前

huggingface / large_language_model_training_playbook

#自然语言处理#An open collection of implementation tips, tricks and resources for training large language models

CUDA 大语言模型 nccl 自然语言处理 performance Python PyTorch scalability troubleshooting large-language-models

Python 475

2 年前

LambdaLabsML / distributed-training-guide

Best practices & guides on how to write distributed pytorch training code

CUDA deepspeed distributed-training gpu gpu-cluster kuberentes nccl PyTorch slurm cluster mpi sharding

Python 442

4 个月前

FZJ-JSC / tutorial-multi-gpu

Efficient Distributed GPU Programming for Exascale, an SC/ISC Tutorial

hpc mpi gpu nccl CUDA

Cuda 275

17 天前

Bluefog-Lib / bluefog

#计算机科学#Distributed and decentralized training framework for PyTorch over graph

mpi distributed-computing 深度学习机器学习 PyTorch decentralized asynchronous one-sided nccl

Python 256

1 年前

microsoft / msrflute

#计算机科学#Federated Learning Utilities and Tools for Experimentation

federated-learning gloo 机器学习 nccl personalization privacy-tools PyTorch Simulation

Python 190

1 年前

google / nccl-fastsocket

#计算机科学#NCCL Fast Socket is a transport layer plugin to improve NCCL collective communication performance on Google Cloud.

nccl training 机器学习

C++ 117

2 年前

openhackathons-org / nways_multi_gpu

N-Ways to Multi-GPU Programming

CUDA hpc mpi nccl

C 34

2 年前

muriloboratto / NCCL

Sample examples of how to call collective operation functions on multi-GPU environments. A simple example of using broadcast, reduce, allGather, reduceScatter and sendRecv operations.

nccl CUDA mpi

2 年前

JuliaGPU / NCCL.jl

A Julia wrapper for the NVIDIA Collective Communications Library.

Julia 语言 CUDA gpu nccl

Julia 27

1 年前

lanl / pyDNMFk

#计算机科学#Python Distributed Non Negative Matrix Factorization with custom clustering

distributed-computing hpc cupy 机器学习 nccl outofmemory Python

Python 24

2 年前

1duo / nccl-examples

#计算机科学#NCCL Examples from Official NVIDIA NCCL Developer Guide.

Nvidia nccl 深度学习 distributed-systems

CMake 17

7 年前

BaguaSys / bagua-net

High performance NCCL plugin for Bagua.

nccl distributed-computing

Rust 15

4 年前

YinLiu-91 / ncclOperationPlus

use ncclSend ncclRecv realize ncclSendrecv ncclGather ncclScatter ncclAlltoall

nccl CUDA mpi C++

Cuda 8

3 年前

YconquestY / nccl

Summary of call graphs and data structures of NVIDIA Collective Communication Library (NCCL)

computer-network nccl

D2 7

10 个月前

UCBerkeley-Spring2022-CS267-project / blinkplus

Blink+: Increase GPU group bandwidth by utilizing across tenant NVLink.

nccl gpu

Jupyter Notebook 6

3 年前

lancelee82 / pynccl

Nvidia NCCL2 Python bindings using ctypes and numba.

nccl numba Python

Python 5

4 年前

asprenger / distributed-training-patterns

Experiments with low level communication patterns that are useful for distributed training.

mpi nccl horovod Tensorflow distributed-training

Python 5

7 年前