sentencepiece · GitHub Topics

#安卓#Open source real-time translation app for Android that runs locally

translator bluetooth-le realtime-translator Android onnx onnxruntime sentencepiece transformers translation nllb Whisper mobile-app offline

C++ 7.63 k

21 天前

OpenNMT / Tokenizer

#自然语言处理#Fast and customizable text tokenization library with BPE and SentencePiece support

Parsing sentencepiece 自然语言处理 machine-translation bpe unicode tokenization icu Python C++

C++ 305

7 个月前

himkt / konoha

#自然语言处理#🌿 An easy-to-use Japanese Text Processing tool, which makes it possible to switch tokenizers with small changes of code.

自然语言处理 text-processing sentencepiece japanese

Python 241

1 年前

taishan1994 / sentencepiece_chinese_bpe

使用sentencepiece中BPE训练中文词表，并在transformers中进行使用。

sentencepiece tokenization

Python 117

2 年前

lingvanex-mt / models

#自然语言处理#Free and open source pre-trained translation models, including Kurdish, Samoan, Xhosa, Lao, Corsican, Cebuano, Galician, Yiddish, Swahili, Russian, Belarusian and Yoruba.

ctranslate2 machine-translation multilingual neural-networks 自然语言处理 sentencepiece swahili yoruba translate translation translator

1 个月前

dhpollack / huggingface_libtorch

#自然语言处理#Minimal example of using a traced huggingface transformers model with libtorch

PyTorch libtorch 自然语言处理 C++sentencepiece albert

C++ 35

5 年前

nguyenvulebinh / vietnamese-roberta

#自然语言处理#A Robustly Optimized BERT Pretraining Approach for Vietnamese

vietnamese pretrained-models 自然语言处理 roberta bert PyTorch fairseq sentencepiece vietnamese-nlp transformer

Python 32

9 个月前

eliben / go-sentencepiece

#大语言模型#Go implementation of the SentencePiece tokenizer

encoding Go language-model 大语言模型 tokenization sentencepiece

Go 27

7 个月前

bnosac / sentencepiece

#自然语言处理#R package for Byte Pair Encoding / Unigram modelling based on Sentencepiece

sentencepiece byte word-segmentation 自然语言处理

C++ 25

2 年前

Andras7 / gpt2-pytorch

Extremely simple and understandable GPT2 implementation with minor tweaks

PyTorch gpt2 sentencepiece transformers

Python 21

5 年前

danieldk / sentencepiece

Rust binding for the sentencepiece library

sentencepiece Rust

Rust 20

2 年前

stephantul / piecelearn

Learning BPE embeddings by first learning a segmentation model and then training word2vec

bpe sentencepiece embeddings word2vec

Python 19

2 年前

Systemcluster / kitoken

#自然语言处理#Fast and versatile tokenizer for language models, compatible with SentencePiece, Tokenizers, Tiktoken and more. Supports BPE, Unigram and WordPiece tokenization in JavaScript, Python and Rust.

bpe 自然语言处理 sentencepiece Parsing unigram word-segmentation Node.js Python Rust Web

Rust 19

24 天前