pdf-to-text · GitHub Topics

infiniflow / ragflow

#自然语言处理#RAGFlow 是一款基于深度文档理解构建的开源 RAG（Retrieval-Augmented Generation）引擎

TypeScript 48.68 k

2 天前

docling-project / docling

Get your documents ready for gen AI

人工智能 convert documents pdf tables document-parser document-parsing docx HTML Markdown pdf-converter pdf-to-json pdf-to-text pptx xlsx

Python 26.95 k

1 天前

Unstructured-IO / unstructured

#自然语言处理#Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.

深度学习 document-parsing 机器学习自然语言处理 OCR information-retrieval data-pipelines preprocessing pdf-to-text pdf pdf-to-json document-image-analysis donut document-image-processing document-parser docx langchain 大语言模型

HTML 10.85 k

5 天前

run-llama / llama_cloud_services

Knowledge Agents and Management in the Cloud

document Parsing pdf pdf-document-processor pptx structured-data document-parser document-parsing docx-to-markdown pdf-to-excel pdf-to-json pdf-to-text ppt-to-json tables ppt-to-markdown pdf-to-markdown

Python 3.87 k

1 天前

enoch3712 / ExtractThinker

#自然语言处理#ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.

人工智能大语言模型自然语言处理 OCR openai Python document-image-analysis document-intelligence document-parsing document-processing langchain 机器学习 pdf pdf-to-text

Python 1.19 k

5 天前

Academic-Hammer / SciTSR

Table structure recognition dataset of the paper: Complicated Table Structure Recognition

table-structure-recognition pdf-to-text

Python 360

5 年前

pd3f / pd3f

#计算机科学#🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based

pdf text-extraction pdf-to-text pipeline 机器学习 OCR language-model extract-text parsr Python

HTML 314

1 年前

shoryasethia / markdrop

#大语言模型#A Python package for converting PDFs to markdown while extracting images and tables, generate descriptive text descriptions for extracted tables/images using several LLM clients. And many more functio...

Open Source pypi-package image-to-text 大语言模型 pdf-to-markdown pdf-to-text table-to-text agents

Python 89

17 天前

nainiayoub / pdf-text-data-extractor

PDF text data extraction web app with OCR for scanned documents

pdf-to-text Streamlit streamlit-webapp text-extraction Python OCR ocr-python pdf

Python 87

10 个月前

datalogics / adobe-pdf-library-samples

Sample code for the Datalogics C++, Java, and .NET interfaces of the Adobe PDF Library

OCR pdf pdf-conversion pdf-converter pdf-document pdf-generation pdf-lib pdf-manipulation pdf-merger pdf-parser pdf-to-text pdf-tools pdfa

C# 82

2 年前

NanoNets / ocr-python

OCR library to extract text & tables from PDF files and images. Convert any image or PDF to CSV / TXT / JSON / Searchable PDF.

OCR tesseract pdf Python pdf-to-json pdf-to-text image-to-text

Jupyter Notebook 80

2 年前

BitMiracle / Docotic.Pdf.Samples

C# and VB.NET samples for Docotic.Pdf library

pdf-library pdf-to-text pdf-signature pdf-generation extract-text net-core pdf-manipulation pdf-parser html-to-pdf

Visual Basic .NET 78

2 天前

galkahana / pdf-text-extraction

cli for extracting text from PDF files (and maybe possibly tables)

pdf pdf-to-text

C++ 77

23 天前

papercast-dev / papercast

#自然语言处理#A Python pipeline tool and plugin ecosystem for processing technical documents. Process papers from arXiv, SemanticScholar, PDF, with GROBID, LangChain, listen as podcast. Customize your own pipelines...

arxiv Python dag 自然语言处理 pdf-converter pdf-document-processor pipeline document-parser document-parsing pdf-to-text podcast tts

Python 49

1 个月前

iditectweb / converter

Standalone .NET Converter library, not require Adobe Acrobat component nor Microsoft Office Interop Assemblies, to convert PDF, DOCX, XLSX, HTML, Image, CSV, RTF, TXT in .NET framework

pdf-to-text html-to-pdf

C# 40

6 年前