warc · GitHub Topics

"Your own personal internet archive" (网站存档 / 爬虫)，一个自托管的网站时光机

pocket wget browser-bookmarks pinboard Chromium Firefox backups RSS web-archiving Python wayback-machine youtube-dl 自托管 headless-browser digipres warc

Python 23.62 k

23 天前

internetarchive / heritrix3

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.

Java warc

Java 2.95 k

12 天前

Rhizome-Conifer / conifer

Collect and revisit web pages.

web-archiving archives Python Docker warc

Python 1.5 k

3 个月前

ArchiveTeam / grab-site

#网络爬虫#The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns

翻译 - 档案管理员的网络爬虫：WARC输出，所有爬网的仪表板，动态忽略模式

archiving crawl spider 爬虫 warc

Python 1.47 k

9 个月前

webrecorder / archiveweb.page

A High-Fidelity Web Archiving Extension for Chrome and Chromium based browsers!

Chromium 插件 web-archiving archiving browser-extension warc

TypeScript 981

3 个月前

webrecorder / replayweb.page

Serverless replay of web archives directly in the browser

web-archiving web-archive wayback-machine warc service-worker

TypeScript 784

1 个月前

webrecorder / browsertrix-crawler

#网络爬虫#Run a high-fidelity browser-based web archiving crawler in a single Docker container

爬虫 crawling warc web-archiving web-crawler

TypeScript 748

3 天前

oduwsdl / ipwb

InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS

IPFS warc web-archiving Python service-worker Docker

Python 625

1 个月前

webrecorder / webrecorder-player

Webrecorder Player for Desktop (OSX/Windows/Linux). (Built with Electron + Webrecorder)

warc Electron web-archiving

JavaScript 446

5 年前

webrecorder / warcio

Streaming WARC/ARC library for fast web archive IO

web-archiving warc Python

Python 408

4 个月前

Florents-Tselai / WarcDB

#网络爬虫#WarcDB: Web crawl data as SQLite databases.

crawling SQLite warc 命令行界面数据库 web-archiving

Python 397

9 个月前

machawk1 / wail

🐋 Web Archiving Integration Layer: One-Click User Instigated Preservation

web-archiving Python GUI warc pyinstaller

Roff 372

1 个月前

commoncrawl / news-crawl

#网络爬虫#News crawling with StormCrawler - stores content as WARC

爬虫 news warc web-crawler

Java 341

2 个月前

bitextor / bitextor

#网络爬虫#Bitextor generates translation memories from multilingual websites

dictionaries 爬虫 wget Parsing warc corpus-tools corpus-processing machine-translation neural-machine-translation statistical-machine-translation

Python 292

5 个月前

webrecorder / browsertrix

Browsertrix is the hosted, high-fidelity, browser-based crawling service from Webrecorder designed to make web archiving easier and more accessible for all!

archiving cloud warc web-archive web-archiving Kubernetes

TypeScript 260

3 天前

harvard-lil / warc-gpt

WARC + AI - Experimental Retrieval Augmented Generation Pipeline for Web Archive Collections.

人工智能 rag warc

Python 245

2 个月前

machawk1 / warcreate

Chrome extension to "Create WARC files from any webpage"

Chrome 插件 warc web-archiving

JavaScript 220

1 年前

cocrawler / cocrawler

#网络爬虫#CoCrawler is a versatile web crawler built using modern tools and concurrency.

爬虫 Python async-python warc screenshot concurrency aiohttp

Python 190

3 年前

cocrawler / cdx_toolkit

A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine

web-archiving warc Python

Python 169

3 个月前

helgeho / ArchiveSpark

An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.

Apache Spark web-archiving internet-archive warc

Scala 148

4 天前