tika · GitHub Topics

The Apache Tika toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF).

Java tika metadata extraction content

Java 3.09 k

1 天前

dadoonet / fscrawler

#网络爬虫#Elasticsearch File System Crawler (FS Crawler)

Java elasticsearch 爬虫 tika

Java 1.4 k

8 天前

yobix-ai / extractous

#自然语言处理#Fast and efficient unstructured data extraction. Written in Rust with bindings for many languages.

extraction pdf tika unstructured unstructured-data data-pipelines docx etl etl-pipelines 大语言模型机器学习自然语言处理 OCR pdf-parser rag Rust

Rust 1.18 k

7 个月前

USCDataScience / sparkler

#搜索#Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.

solr web-crawler Apache Spark nutch tika big-data information-retrieval 搜索引擎 search distributed-systems

Java 416

2 年前

ICIJ / extract

A cross-platform command line tool for parallelised content extraction and analysis.

tika etl index solr

Java 245

18 小时前

KevM / tikaondotnet

Use the Java Tika text extraction library on the .NET platform

tika extract-text

Rich Text Format 206

1 年前

apache / tika-docker

Convenience Docker images for Apache Tika Server

Docker Image tika

Shell 192

24 天前

shebinleo / pdf2html

pdf2html is a module which helps to convert PDF file to HTML pages using Apache Tika. This module also helps to generate thumbnail image for PDF file using Apache PDFBox.

Node.js pdf-converter tika pdfbox thumbnail

JavaScript 184

7 天前

chrismattmann / MLwithTensorFlow2ed

#计算机科学#Code for Machine Learning with TensorFlow: 2nd Edition Published by Manning Publications

Tensorflow 机器学习 manning-publications tika Python Docker 深度学习 regression classification clustering autoencoder

Jupyter Notebook 140

3 年前

nasa-jpl-memex / memex-explorer

#网络爬虫#Viewers for statistics and dashboarding of Domain Search Engine data

anaconda 爬虫 dashboard nutch apache tika

Python 124

9 年前

vaites / php-apache-tika

Apache Tika bindings for PHP: extract text and metadata from documents, images and other formats

apache tika text-extraction text-recognition OCR php-library

PHP 116

4 个月前

chrismattmann / tika-similarity

#计算机科学#Tika-Similarity uses the Tika-Python package (Python port of Apache Tika) to compute file similarity based on Metadata features.

机器学习 clustering information-retrieval cosine-similarity Python tika

Python 108

3 个月前

nasa-jpl-memex / image_space

#计算机科学#Interactive Image similarity and Visual Search and Retrieval application

image-recognition image-viewer image-analysis Python 深度学习机器视觉 kitware 机器学习 alexnet tika

JavaScript 96

1 年前

chrismattmann / imagecat

ImageCat is an Apache OODT RADIX application that uses Apache Solr, Apache Tika and Apache OODT to ingest 10s of millions of files (images,but could be extended to other files) in place, and to extrac...

memex solr tika apache

Java 95

7 年前