corpus-tools · GitHub Topics

#网络爬虫#Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML

web-scraping text-extraction 自然语言处理 text-mining 爬虫 text-preprocessing article-extractor readability scraping html-to-markdown corpus-tools rss-feed news-aggregator rag 大语言模型

Python 4.12 k

1 个月前

BLKSerene / Wordless

An Integrated Corpus Tool With Multilingual Support for the Study of Language, Literature, and Translation

翻译 - 具有多语言支持的集成语料库工具，用于语言，文学和翻译研究

corpus corpus-linguistics corpus-tools corpus-processing literature translation Parsing tagger lemmatizer dependency-parser

Python 715

17 天前

flairNLP / fundus

#网络爬虫#A very simple news crawler with a funny name

corpus 爬虫自然语言处理 Python RSS scraper sitemap text-extraction web-scraping corpus-tools 数据集 image-classification

Python 367

3 天前

bitextor / bitextor

#网络爬虫#Bitextor generates translation memories from multilingual websites

dictionaries 爬虫 wget Parsing warc corpus-tools corpus-processing machine-translation neural-machine-translation statistical-machine-translation

Python 292

5 个月前

grammarly / ua-gec

#自然语言处理#UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

dataset corpus corpus-data corpus-tools 自然语言处理

Macaulay2 259

1 年前

adbar / simplemma

#自然语言处理#Simple multilingual lemmatizer for Python, especially useful for speed and efficiency

自然语言处理 lemmatizer tokenization wordlist morphological-analysis corpus-tools Parsing language-detection language-identification

Python 154

5 个月前

ynop / audiomate

Python library for handling audio datasets.

audio speech-recognition corpus-tools data-loader speech music noise

Python 137

2 年前

Helsinki-NLP / OpusFilter

#自然语言处理#OpusFilter - Parallel corpus processing toolkit

corpus-tools corpus-processing 自然语言处理 machine-translation

Python 104

18 天前

NathanDuran / Switchboard-Corpus

Utilities for Processing the Switchboard Dialogue Act Corpus

corpus corpus-processing corpus-data corpus-tools dialogue

Python 68

4 年前

czcorpus / kontext

An advanced, extensible web front-end for the Manatee-open corpus search engine

corpus-tools corpus-linguistics ui

TypeScript 64

3 天前

koskenni / beta

An open source reimplementation of Benny Brodda's BETA in Python

beta string-manipulation Open Source hyphenation corpus-tools

Python 63

5 年前

lennes / spect

SpeCT - Speech Corpus Toolkit for Praat. Documentation: https://lennes.github.io/spect/

speech analysis annotation corpus-linguistics corpus-tools speech-analysis transcription

HTML 57

2 年前

LanguageMachines / PICCL

#自然语言处理#A set of workflows for corpus building through OCR, post-correction and normalisation

自然语言处理 workflow OCR corpus-tools corpus-linguistics computational-linguistics

Python 48

3 年前

johentsch / ms3

A parser for annotated MuseScore 3 files.

corpus corpus-data corpus-processing corpus-tools musescore Parser sheet-music tsv xml-parser xml-parser-library xml-parsing

Python 47

19 天前

silenterus / deepspeech-cleaner

#计算机科学#Multi-Language Dataset Cleaner/Creator for Mozilla's DeepSpeech Framework

deepspeech 机器学习 Mozilla speech-recognition corpus-tools multilanguage

Python 47

2 年前

nickduran / align-linguistic-alignment

Python library for extracting quantitative, reproducible metrics of multi-level alignment between speakers in naturalistic language corpora.

Python notebooks nltk word2vec corpus-tools text-analysis

Python 45

5 个月前

M4t1ss / parallel-corpora-tools

#自然语言处理#Tools for filtering and cleaning parallel and monolingual corpora for machine translation and other natural language processing tasks.

nmt neural machine translation machine-translation neural-machine-translation corpus-tools 自然语言处理 language language-processing natural-language filtering cleaning 数据科学 data-processing

PHP 41

1 年前

uma-pi1 / OPIEC

#自然语言处理#Reading the data from OPIEC - an Open Information Extraction corpus

information-extraction corpus corpus-data corpus-tools 自然语言处理 natural-language-understanding wikipedia Wiki corpus-processing dataset

Java 37

6 年前

johnwdubois / rezonator

Rezonator: Dynamics of human engagement

corpus-tools text-analysis 游戏开发 dialogue corpus-linguistics conversational-ai

Yacc 35

6 个月前

NathanDuran / MRDA-Corpus

Utilities for Processing the Meeting Recorder Dialogue Act Corpus

corpus corpus-data corpus-processing corpus-tools dialogue

Python 32

4 年前