text-extraction · GitHub Topics

#网络爬虫#Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML

web-scraping text-extraction 自然语言处理 text-mining 爬虫 text-preprocessing article-extractor readability scraping html-to-markdown corpus-tools rss-feed news-aggregator rag 大语言模型

Python 4.12 k

1 个月前

miso-belica / sumy

#自然语言处理#Module for automatic summarization of text documents and HTML pages.

翻译 - 自动汇总文本文档和HTML页面的模块。

Python lsa textteaser html-page summarizer pagerank-algorithm reduction text-extraction html-extraction html-extractor summarization summary 自然语言处理

Python 3.57 k

1 年前

unidoc / unipdf

Golang PDF library for creating and processing PDF files (pure go)

翻译 - Golang PDF库，用于创建和处理PDF文件（pure go）

Go pdf pdf-library pdf-generation pdf-document-processor text-extraction pdf-manipulation signing pdf-sign pdf-generator

Go 2.75 k

20 天前

Goldziher / kreuzberg

A text extraction library supporting PDFs, images, office documents and more

asyncio docx OCR pdf text-extraction

Python 1.75 k

3 天前

chrismattmann / tika-python

#自然语言处理#Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.

Python Parsing text-extraction mime buffer memex text-recognition detection recognition 自然语言处理 nlp-library COVID-19 extraction

Python 1.57 k

6 天前

whitelok / image-text-localization-recognition

#计算机科学#A general list of resources to image text localization and recognition 场景文本位置感知与识别的论文资源与实现合集シーンテキストの位置認識と識別のための論文リソースの要約

text-recognition text-detection convolutional-neural-networks 深度学习 OCR text-extraction 机器学习 Awesome Lists

952

2 年前

miso-belica / jusText

Heuristic based boilerplate removal tool

Python text-extraction html-parser html-parsing

Python 765

2 个月前

unidoc / unidoc

This repository has moved! https://github.com/unidoc/unipdf

Go pdf pdf-library pdf-files text-extraction pdf-invoice

Go 709

6 年前

ICIJ / datashare

A self-hosted search engine for documents.

named-entity-recognition text-extraction extract investigative-journalism elasticsearch Docker web-gui

Java 626

2 天前

ropensci / pdftools

Text Extraction, Rendering and Converting of PDF Documents

text-extraction R rstats pdf-files r-package

C++ 533

1 个月前

cdown / srt

A simple library and set of tools for parsing, modifying, and composing SRT files.

srt subtitle subtitles text-extraction Python mit-license 工具命令行界面 command-line-tool Library

Python 498

1 年前

shixzie / nlp

#自然语言处理#[UNMANTEINED] Extract values from strings and fill your structs with nlp.

自然语言处理 Parsing Go text-extraction text

Go 387

8 年前

flairNLP / fundus

#网络爬虫#A very simple news crawler with a funny name

corpus 爬虫自然语言处理 Python RSS scraper sitemap text-extraction web-scraping corpus-tools 数据集 image-classification

Python 367

3 天前

iamarunbrahma / vision-parse

Parse PDFs into markdown using Vision LLMs

document-parser pdf-parser pdf-to-markdown text-extraction

Python 339

2 个月前

pd3f / pd3f

#计算机科学#🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based

pdf text-extraction pdf-to-text pipeline 机器学习 OCR language-model extract-text parsr Python

HTML 314

1 年前

py-pdf / benchmarks

Benchmarking PDF libraries

benchmark data-extraction mupdf pdf pypdf2 text-extraction

Python 269

1 年前

bookieio / breadability

Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)

Python text-mining text-extraction html-extraction html-extractor html-parsing

HTML 204

1 年前

weareprestatech / hotpdf

hotpdf is a fast PDF parsing library to extract text and find text within PDF documents built on top of pdfminer.six

pdf Python text-extraction text-search

Python 186

4 个月前

SapienzaNLP / extend

#自然语言处理#Entity Disambiguation as text extraction (ACL 2022)

自然语言处理 Entity resolution text-extraction PyTorch acl

Python 181

3 年前

skylander86 / lambda-text-extractor

AWS Lambda functions to extract text from various binary formats.

text-extraction aws-lambda OCR lambda-functions pdf tesseract

Python 176

7 年前