content-extraction · GitHub Topics

Official Firecrawl MCP Server - Adds powerful web scraping to Cursor, Claude and any other LLM clients.

batch-processing claude content-extraction data-collection firecrawl firecrawl-ai llm-tools mcp-server model-context-protocol search-api web-crawler web-scraping javascript-rendering

JavaScript 2.43 k

9 天前

currentslab / extractnet

#计算机科学#A fork of Dragnet that also extract author, headline, date, keywords from context, as well as built in metadata extraction all in one package

content-extraction webscraping web-scraping text-mining news 机器学习 Python

HTML 274

1 年前

graphlit / graphlit-mcp-server

Model Context Protocol (MCP) Server for Graphlit Platform

claude content-extraction data-collection llm-tools mcp-server model-context-protocol search-api unstructured-data web-crawler web-scraping

TypeScript 187

1 天前

mvasilkov / readability2

Readability2 converts HTML to plain text.

JavaScript readability HTML plaintext content-extraction

TypeScript 109

6 年前

tuffstuff9 / nextjs-pdf-parser

Next.js template for seamless PDF parsing using pdf2json and FilePond. Ideal for developers seeking a ready-to-use solution for PDF content extraction in Next.js projects.

content-extraction filepond Next pdf-parser pdf-parsing

TypeScript 59

1 年前

gregors / boilerpipe-ruby

Pure ruby implementation of the Boilerpipe content extraction algorithm tuned for online articles

content-extraction webscraping news

Ruby 43

4 年前

nikitautiu / learnhtml

#计算机科学#Web content extraction using machine learning

深度学习 HTML content-extraction

HTML 33

4 年前

oiwn / dom-content-extraction

#网络爬虫#DOM Based Content Extraction via Text Density

scraping content-extraction

Rust 26

25 天前

spences10 / mcp-jinaai-reader

🔍 Model Context Protocol (MCP) tool for parsing websites using the Jina.ai Reader

content-extraction documentation-tool llm-tools mcp model-context-protocol text-extraction web-scraping

JavaScript 25

8 天前

pdfix / pdfix_sdk_example_cpp

Make PDF Files Accessible, Extract Data from PDF, Convert PDF to HTML, Fill-in PDF Form, Stamp PDF and more...

pdfua digital-signature pdf-converter pdf-manipulation extract-data watermark HTML metadata conversion converter tagging wcag sign pdf content-extraction Web Accessibility (a11y)

C++ 20

1 个月前

gdamdam / sumo

#自然语言处理#Tool to extracts the text from a web article urls and get frequency words, entities recognition, automatic summary and more

自然语言处理 content-extraction nltk entity-recognition semantic-analysis

Python 20

6 年前

timoteostewart / benson

Benson turns a list of URLs into mp3s of the contents of each web page - take control over your reading backlog!

content-extraction web-scraping productivity

Python 14

5 个月前

bencmc / youtube_video_summarizer

#自然语言处理#This repository houses a Python application for extracting YouTube video transcripts and summarizing its content.

content-extraction gpt-35-turbo natural 自然语言处理 openai Python text-processing video-processing youtube-api langchain-python Streamlit

Python 13

2 年前

peremenov / seize

Seize is light Node or Browser web-page content extractor inspired by arc90 readability and Safari Reader

content-extraction Document Object Model (DOM)readability extract reader

HTML 12

8 年前

zeoagency / mobile-first-indexing-tool

Mobile First Indexing Tool

搜索引擎优化 (SEO)content-extraction aws-lambda lighthouse

Python 12

3 年前

LandWhale2 / TD-Spider

#网络爬虫#Via Text Density Simple Web Crawler With Go

Go web-crawler content-extraction data-mining Document Object Model (DOM)Open Source scraping

Go 12

2 年前

leroyanders / acrticle-scrapper

This Python-based repository hosts a sophisticated service designed for scraping web articles and converting them into Markdown format. The core functionality of this service includes extracting the m...

article-parser content-extraction Python web-scraping

Python 5

1 年前