web-crawler · GitHub Topics

mendableai / firecrawl

#网络爬虫#Firecrawl 是一种 API 服务，它爬取URL并将其转换为清洗过的 markdown 或结构化数据

人工智能爬虫 data Markdown scraper html-to-markdown 大语言模型 rag scraping web-crawler ai-scraping webscraping

TypeScript 35.34 k

11 小时前

apify / crawlee

#网络爬虫#Crawlee - 一个用于Node.js 开发的网页爬虫和浏览器自动化库

web-scraping web-crawling npm headless-chrome Puppeteer 自动化 apify scraping crawling 爬虫 headless scraper web-crawler JavaScript Node.js Playwright TypeScript

TypeScript 17.41 k

1 天前

crawlab-team / crawlab

#网络爬虫#Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台，支持任何语言和框架

webcrawler scrapy crawlab spiders-management Go scrapyd-ui spider 爬虫 webspider web-crawler Docker platform crawling-tasks

Go 11.7 k

5 小时前

ssssssss-team / spider-flow

#网络爬虫#新一代爬虫平台，以图形化方式定义爬虫流程，不写代码即可完成爬虫。

spider 爬虫 jsoup xpath web-spider webspider webcrawler web-crawler spider-flow

Java 9.91 k

2 年前

BruceDone / awesome-crawler

#网络爬虫#A collection of awesome web crawler,spider in different languages

web-crawler 爬虫 web-scraper spider scraper Awesome Lists

6.71 k

10 个月前

adithya-s-k / omniparse

Ingest, parse, and optimize any data format ➡️ from documents to multimedia ➡️ for enhanced compatibility with GenAI frameworks

OCR omniparser parse-server parser-library vision-transformer web-crawler

Python 6.47 k

3 天前

apify / crawlee-python

#网络爬虫#Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works...

apify 自动化 beautifulsoup 爬虫 crawling headless headless-chrome pip Playwright Python scraper scraping web-crawler web-crawling web-scraping Hacktoberfest

Python 5.51 k

6 小时前

apache / nutch

#网络爬虫#Apache Nutch is an extensible and scalable web crawler

翻译 - 阿帕奇·纳奇（Apache Nutch）

Java nutch web-crawler crawling hadoop apache

Java 3 k

15 天前

mendableai / firecrawl-mcp-server

Official Firecrawl MCP Server - Adds powerful web scraping to Cursor, Claude and any other LLM clients.

batch-processing claude content-extraction data-collection firecrawl firecrawl-ai llm-tools mcp-server model-context-protocol search-api web-crawler web-scraping javascript-rendering

JavaScript 2.42 k

9 天前

sjdirect / abot

#网络爬虫#Cross Platform C# web crawler framework built for speed and flexibility. Please star this project! +1.

C#爬虫 web-crawler Parsing spider spiders pluggable Unit testing netcore netcore2 netcore3 netstandard20 cross-platform

C# 2.27 k

7 个月前

jasonxtn / Argus

The Ultimate Information Gathering Toolkit

dns-lookup information-gathering OSINT recon-tools reconnaissance virustotal web-crawler whois-lookup

Python 1.93 k

6 个月前

xianhu / PSpider

#网络爬虫#简单易用的Python爬虫框架，QQ交流群：597510560

爬虫 spider Python proxies web-spider multi-threading web-crawler python-spider multiprocessing

Python 1.84 k

3 年前

MarginaliaSearch / MarginaliaSearch

#搜索#Internet search engine for text-oriented websites. Indexing the small, old and weird web.

搜索引擎 no-cloud small-web internet-search indexer language-processing web-crawler alt-search no-ai-used 自托管

HTML 1.31 k

4 天前

Algebra-FUN / WeReadScan

扫描“微信读书”已购图书并下载本地PDF的爬虫

Selenium weread web-crawler book-downloader

Python 939

2 年前

apache / incubator-stormcrawler

#网络爬虫#A scalable, mature and versatile web crawler based on Apache Storm

web-crawler distributed Java 爬虫

Java 904

1 天前

postmodern / spidr

#网络爬虫#A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.

spider Ruby 爬虫 Web scraper web-scraping web-spider web-crawler web-scraper

Ruby 816

2 个月前