#网络爬虫#Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
#自然语言处理#Module for automatic summarization of text documents and HTML pages.
翻译 - 自动汇总文本文档和HTML页面的模块。
Golang PDF library for creating and processing PDF files (pure go)
翻译 - Golang PDF库,用于创建和处理PDF文件(pure go)
#自然语言处理#Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
A text extraction library supporting PDFs, images, office documents and more
#计算机科学#A general list of resources to image text localization and recognition 场景文本位置感知与识别的论文资源与实现合集 シーンテキストの位置認識と識別のための論文リソースの要約
Heuristic based boilerplate removal tool
This repository has moved! https://github.com/unidoc/unipdf
A self-hosted search engine for documents.
Text Extraction, Rendering and Converting of PDF Documents
A simple library and set of tools for parsing, modifying, and composing SRT files.
#自然语言处理#[UNMANTEINED] Extract values from strings and fill your structs with nlp.
#网络爬虫#A very simple news crawler with a funny name
#计算机科学#🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based
Parse PDFs into markdown using Vision LLMs
Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)
hotpdf is a fast PDF parsing library to extract text and find text within PDF documents built on top of pdfminer.six
#自然语言处理#Entity Disambiguation as text extraction (ACL 2022)
AWS Lambda functions to extract text from various binary formats.