"Your own personal internet archive" (网站存档 / 爬虫),一个自托管的网站时光机
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
A High-Fidelity Web Archiving Extension for Chrome and Chromium based browsers!
Serverless replay of web archives directly in the browser
#网络爬虫#Run a high-fidelity browser-based web archiving crawler in a single Docker container
InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS
Webrecorder Player for Desktop (OSX/Windows/Linux). (Built with Electron + Webrecorder)
#网络爬虫#WarcDB: Web crawl data as SQLite databases.
🐋 Web Archiving Integration Layer: One-Click User Instigated Preservation
#网络爬虫#News crawling with StormCrawler - stores content as WARC
#网络爬虫#Bitextor generates translation memories from multilingual websites
Browsertrix is the hosted, high-fidelity, browser-based crawling service from Webrecorder designed to make web archiving easier and more accessible for all!
WARC + AI - Experimental Retrieval Augmented Generation Pipeline for Web Archive Collections.
Chrome extension to "Create WARC files from any webpage"
#网络爬虫#CoCrawler is a versatile web crawler built using modern tools and concurrency.
A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine
An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.