Read and extract text and other content from PDFs in C# (port of PDFBox)
翻译 - 在C#(PdfBox的端口)中读取和提取PDF中的文本和其他内容
OCR engine for all the languages
Document Layout Analysis resources repos for development with PdfPig.
Page to PAGE Layout Analysis Tool
Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)
A powerful CLI tool for visualization and encoding of PAGE-XML files
Dataset and models for catalogs' Layout analysis and HTR
Automatically re-order lines, words and glyphs to become textually consistent with their parents.
About The repo gt_structure_1_4 is part of the OCR-D Ground Truth Structure corpus. Only the structure of the printed page is annotated. The corpus was created as a result of the DFG project OCR-D.
The repo gt_structure_1_3 is part of the OCR-D Ground Truth Structure corpus. Only the structure of the printed page is annotated. The corpus was created as a result of the DFG project OCR-D.
OCR-D wrapper for page-xml-draw
The repo gt_structure_1_2 is part of the OCR-D Ground Truth Structure corpus. Only the structure of the printed page is annotated. The corpus was created as a result of the DFG project OCR-D.
The GBN Dataset consists German-Brazilian historical newspapers, along with their digital and binarized images and ground truth files.
The repo gt_structure_1_1 is part of the OCR-D Ground Truth Structure corpus. Only the structure of the printed page is annotated. The corpus was created as a result of the DFG project OCR-D.