document-processing · GitHub Topics

#自然语言处理#ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.

人工智能大语言模型自然语言处理 OCR openai Python document-image-analysis document-intelligence document-parsing document-processing langchain 机器学习 pdf pdf-to-text

Python 1.19 k

5 天前

dhlab-epfl / dhSegment

Generic framework for historical document processing

Tensorflow segmentation historical-data Python document-processing

Python 375

4 年前

awslabs / project-lakechain

#自然语言处理#⚡ Cloud-native, AI-powered, document processing pipelines on AWS.

Amazon Web Services 机器视觉 document-processing generative-ai 机器学习自然语言处理 retrieval-augmented-generation Serverless Hacktoberfest aws-cdk

TypeScript 175

25 天前

formkiq / formkiq-core

A full-featured Document Management Platform / Document Layer for your application, providing storage, discovery, processing, and retrieval. Deploys directly into your Amazon Web Services Cloud. Pleas...

amazon-web-services Amazon Web Services cloud-storage dms document-database document-management document-management-system document-processing headless Serverless OCR optical-character-recognition

Java 124

1 天前

awslabs / rhubarb

A Python framework for multi-modal document understanding with Amazon Bedrock

amazon-bedrock document-processing generative-ai multi-modal

Python 81

3 天前

iamarunbrahma / pdf-to-markdown

Conversion of PDF documents to structured Markdown, optimized for Retrieval Augmented Generation (RAG) and other NLP tasks. Extract text, tables, and images with preserved formatting for enhanced info...

document-processing information-retrieval pdf-parsing pdf-to-markdown Python rag retrieval-augmented-generation text-extraction pdf-converter

Python 69

5 个月前

parsee-ai / parsee-core

#大语言模型#Retrieval of fully structured data made easy. Use LLMs or custom models. Specialized on PDFs and HTML files. Extensive support of tabular data extraction and multimodal queries.

document-processing 大语言模型 structured-data multimodal

Python 68

22 天前

steindani / pandoc-include

An include filter for Pandoc

pandoc pandoc-filter Markdown document-processing

Haskell 62

4 年前

aws-solutions / enhanced-document-understanding-on-aws

Enhanced Document Understanding on AWS delivers an easy-to-use web application that ingests and analyzes documents, extracts content, identifies and redacts sensitive customer information, and creates...

document-analysis document-processing

JavaScript 37

9 天前

cburschka / lyx

Unofficial mirror of git://git.lyx.org/lyx.git (updates daily. not affiliated with lyx.org.)

mirror document-processing LaTeX

C++ 36

2 年前

kili-technology / awesome-datasets

#自然语言处理#A comprehensive list of annotated training datasets classified by use case.

awesome-public-datasets 数据集 Open Data dataset data open-datasets annotation 自然语言处理 entity-extraction ner entity-recognition document-processing OCR

3 年前

afrozas / proceedings

Semantic extraction from conference proceedings.

conferences semantic spaCy document-processing

Python 31

5 年前

jmanhype / DSPy-Multi-Document-Agents

#自然语言处理#An advanced distributed knowledge fabric for intelligent document processing, featuring multi-document agents, optimized query handling, and semantic understanding.

人工智能 distributed-systems document-processing knowledge-management 自然语言处理 query-optimization vector-search

Python 27

8 个月前

MBAigner / PDFSegmenter

This library builds a graph-representation of the content of PDFs. The graph is then clustered, resulting page segments are classified and returned. Tables are retrieved formatted as a CSV.

pdf document-processing Python layout-analysis annotations CSV table

Python 22

5 年前

greed2411 / tokyo

tokyo, a REST API, when given any type of document 📄, Identifies mime-type 🧐. Suggests extension 🦔. Alas Extracts text 💪.

document-processing Clojure ring mime-types 插件 extract-text filetype text-extraction

Clojure 18

5 年前

eklem / stopword-trainer

#自然语言处理#A module for creating stopword lists for any language, based on a set of documents.

自然语言处理 document-processing information-retrieval

JavaScript 14

7 个月前

abdullahshafiq-20 / ResumeConvertorLatex

ResumeTex is an AI-powered tool that converts standard PDF resumes into professionally formatted LaTeX documents. This service helps you create elegant, structured resumes without needing to learn LaT...

自动化 developer-tools document-processing Express LaTeX Node.js Open Source pdf-parsing React resume Tailwind CSS TeX

JavaScript 14

1 天前