robots-txt · GitHub Topics

PuerkitoBio / gocrawl

#网络爬虫#Polite, slim and concurrent web crawler.

翻译 - 礼貌，苗条和并行的Web搜寻器。

爬虫 robots-txt

Go 2.04 k

4 年前

eliasdabbas / advertools

advertools - online marketing productivity and analysis tools

marketing advertising Python keywords twitter-api 搜索引擎优化 (SEO)social-media YouTube robots-txt scrapy Logging

Python 1.21 k

13 天前

PuerkitoBio / fetchbot

#网络爬虫#A simple and flexible web crawler that follows the robots.txt policies and crawl delays.

爬虫 robots-txt

Go 789

4 年前

nuxt-modules / robots

Tame the robots crawling and indexing your Nuxt site.

Nuxt.js Vue.js nuxt-module robots-txt ssr

TypeScript 461

3 天前

temoto / robotstxt

The robots.txt exclusion protocol implementation for Go language

Go golang-library robots-txt Web production-ready go-library

Go 273

2 年前

TurnerSoftware / InfinityCrawler

#网络爬虫#A simple but powerful web crawler library for .NET

爬虫 web-crawler web-crawling robots-txt spider

C# 251

1 年前

crawler-commons / crawler-commons

A set of reusable Java components that implement functionality common to any web crawler

web-crawler Java robots-txt Open Source Library

Java 243

12 天前

spatie / robots-txt

#网络爬虫#Determine if a page may be crawled from robots.txt, robots meta tags and robot headers

PHP robots-txt 爬虫

PHP 236

2 个月前

GateNLP / ultimate-sitemap-parser

Ultimate Website Sitemap Parser

Python sitemap sitemap-xml robots-txt xml-sitemap

Python 200

13 天前

alexjc / weboptout

Opt-Out tool to check Copyright reservations in a way that even machines can understand.

command-line-tool robots-txt webscraping terms-of-service DataOps copyright

Python 194

1 年前

beb7 / gflare-tk

#网络爬虫#Open-Source Python Based SEO Web Crawler

搜索引擎优化 (SEO)爬虫 scraper Python tkinter robots-txt

Python 170

2 年前

thedaviddias / llms-txt-hub

🤖 The largest directory for AI-ready documentation and tools implementing the proposed llms.txt standard

directory llms Next robots-txt Supabase cursor cursor-ai

TypeScript 162

2 天前

samclarke / robots-parser

NodeJS robots.txt parser with support for wildcard (*) matching.

user-agent JavaScript Node.js robots-txt

JavaScript 153

5 个月前

healsdata / ai-training-opt-out

Known tags and settings suggested to opt out of having your content used for AI training.

人工智能 meta robots-txt

HTML 142

10 个月前

alextim / astro-lib

Makes it easy to add robots.txt, sitemap and web app manifest during build to your Astro app.

Astro 搜索引擎优化 (SEO)robots-txt sitemap sitemap-xml

TypeScript 117

1 年前

jimsmart / grobotstxt

grobotstxt is a native Go port of Google's robots.txt parser and matcher library.

Go robots-txt

Go 110

3 年前

mdreizin / gatsby-plugin-robots-txt

Gatsby plugin that automatically creates robots.txt for your site

gatsby gatsby-plugin robots-txt

JavaScript 106

1 年前

samber / the-great-gpt-firewall

#网络爬虫#🤖 A curated list of websites that restrict access to AI Agents, AI crawlers and GPTs

agent anthropic blocklist censorship 爬虫 genai generative-ai gpt gpt-4 大语言模型 openai robots-txt user-agent firewall

Python 88

11 天前

jonasjacek / robots.txt

#搜索#Simple robots.txt template. Keep unwanted robots out (disallow). White lists (allow) legitimate user-agents. Useful for all websites.

robots-txt user-agent 搜索引擎优化 (SEO)搜索引擎 whitelist crawlers web-crawling crawling

2 个月前

t1gor / Robots.txt-Parser-Class

Php class for robots.txt parse

robots-txt Parser yandex Google w3c PHP

PHP 83

2 年前