#自然语言处理#RAGFlow 是一款基于深度文档理解构建的开源 RAG(Retrieval-Augmented Generation)引擎
Get your documents ready for gen AI
#自然语言处理#Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
Knowledge Agents and Management in the Cloud
#自然语言处理#ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.
Table structure recognition dataset of the paper: Complicated Table Structure Recognition
#计算机科学#🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based
#大语言模型#A Python package for converting PDFs to markdown while extracting images and tables, generate descriptive text descriptions for extracted tables/images using several LLM clients. And many more functio...
PDF text data extraction web app with OCR for scanned documents
Sample code for the Datalogics C++, Java, and .NET interfaces of the Adobe PDF Library
OCR library to extract text & tables from PDF files and images. Convert any image or PDF to CSV / TXT / JSON / Searchable PDF.
C# and VB.NET samples for Docotic.Pdf library
cli for extracting text from PDF files (and maybe possibly tables)
#自然语言处理#A Python pipeline tool and plugin ecosystem for processing technical documents. Process papers from arXiv, SemanticScholar, PDF, with GROBID, LangChain, listen as podcast. Customize your own pipelines...
Standalone .NET Converter library, not require Adobe Acrobat component nor Microsoft Office Interop Assemblies, to convert PDF, DOCX, XLSX, HTML, Image, CSV, RTF, TXT in .NET framework
#自然语言处理#The code base of the front-end of nocodefunctions.com
A Comprehensive Multi-Domain Benchmark for Arabic OCR and Document Understanding
Batch-convert pdf to text, extract data from pdf in python
Simple pdf to text with python using PDFtk and PyPDF2