A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具,将PDF转换成Markdown和JSON格式。
#计算机科学#A Unified Toolkit for Deep Learning Based Document Image Analysis
翻译 - 用于文档布局理解的Python库
An Open-Source Python3 tool with SMALL models for recognizing layouts, tables, math formulas (LaTeX), and text in images, converting them into Markdown format. A free alternative to Mathpix, empowerin...
Read and extract text and other content from PDFs in C# (port of PDFBox)
翻译 - 在C#(PdfBox的端口)中读取和提取PDF中的文本和其他内容
OCR engine for all the languages
Document Layout Analysis resources repos for development with PdfPig.
#计算机科学#Yomitoku is an AI-powered document image analysis package designed specifically for the Japanese language.
#计算机科学#A toolbox of ocr models and algorithms based on MindSpore
📝 针对文档类图像做内容提取,将文档类图像一比一输出到Word或者Txt中,便于进一步使用或处理。后续计划支持输入PDF/图像,输出对应json格式、Txt格式、Word格式和Markdown格式。
#自然语言处理#Doc2Graph transforms documents into graphs and exploit a GNN to solve several tasks.
YOLO models trained by DocLayNet - power your Document Intelligent by Layout Analysis
An official implementation of paper "Paragraph2Graph: A Language-independent GNN-based framework for layout analysis"
#计算机科学#Trained Detectron2 object detection models for document layout analysis based on PubLayNet dataset
[ICDAR 2023] SelfDocSeg: A self-supervised vision-based approach towards Document Segmentation (Oral)
A Large Dataset of Historical Japanese Documents with Complex Layouts
A Unified Toolkit for Deep Learning-Based Table Extraction
#计算机科学#Trained Detectron2 object detection models for document layout analysis based on PubLayNet dataset
#计算机科学#Proof of concept of training a simple Region Classifier using PdfPig and ML.NET (LightGBM). The objective is to classify each text block in a pdf document page as either title, text, list, table and ...
利用java-yolov8实现版面检测(Chinese layout detection),java-yolov8 is used to detect the layout of Chinese document images