corpus-data · GitHub Topics

#自然语言处理#MNBVC(Massive Never-ending BT Vast Chinese corpus)超大规模中文语料集。对标chatGPT训练的40T数据。MNBVC数据集不但包括主流文化，也包括各个小众文化甚至火星文的数据。MNBVC数据集包括新闻、作文、小说、书籍、杂志、论文、台词、帖子、wiki、古诗、歌词、商品介绍、笑话、糗事、聊天记录等一切形式的纯文本中文数据。

中文 chinese-language chinese-nlp chinese-simplified corpus-data 自然语言处理

3.81 k

9 天前

PlexPt / chatgpt-corpus

ChatGPT 中文语料库对话语料小说语料客服语料用于训练大模型

corpus Awesome Lists corpus-data question-answering

890

1 年前

shijiebei2009 / CEC-Corpus

📚中文突发事件语料库（Chinese Emergency Corpus）-上海大学-语义智能实验室

corpus-data

704

6 年前

sheepzh / poetry

#自然语言处理#地球上最全的华语现代诗歌语料库，3k+诗人，80K+诗歌，15M+字

poetry literature 自然语言处理 corpus-data chinese-corpus

Python 684

3 个月前

gkiril / oie-resources

#自然语言处理#A curated list of Open Information Extraction (OIE) resources: papers, code, data, etc.

information-extraction 自然语言处理 papers natural-language-understanding nlu extract-information relation-extraction dataset 数据科学 datascience 人工智能 big-data corpus-data

496

2 年前

guhhhhaa / 4675-scifi

#自然语言处理#chinese NLP corpus of chinese science fiction,chinese science fiction corpus : About 4675 Chinese science fiction novels 大约有4675本科幻小说，中文科幻小说自然语言处理语料库，中文科幻小说文本语料库，中文科幻小说文本数据库，科幻小说语料

scifi corpus corpus-data 自然语言处理 science-fiction chinese-nlp 数据集

394

2 年前

grammarly / ua-gec

#自然语言处理#UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

dataset corpus corpus-data corpus-tools 自然语言处理

Macaulay2 259

1 年前

guhhhhaa / wula-scifi

#自然语言处理#chinese NLP corpus of chinese science fiction, chinese science fiction corpus: Archive of the Ark Plan of Ula Science Fiction Website 乌拉科幻小说网方舟计划存档，中文科幻小说自然语言处理语料库，中文科幻小说文本语料库，中文科幻小说文本数据库，科幻小说语料

corpus corpus-data 自然语言处理 science-fiction scifi chinese-nlp 数据集

109

2 年前