robots工具如何编写robots协议

联启网络工具 2026-06-16 2

本文目录导读：

robots工具如何编写robots协议-第1张图片-电脑手机工具软件下载 - 免费实用工具合集 | 联启科技

什么是 robots.txt
基本语法规则
常见示例
常用爬虫名称（User-agent）
高级用法
验证工具
最佳实践
完整示例
注意事项

我来详细介绍如何编写 robots 协议（Robots Exclusion Protocol）。

什么是 robots.txt

robots.txt 是一个放置在网站根目录的文本文件，用于告诉搜索引擎爬虫哪些页面可以抓取，哪些不可以。

基本语法规则

文件位置

https://yourdomain.com/robots.txt

核心指令

User-agent: [爬虫名称]
Disallow: [禁止访问的路径]
Allow: [允许访问的路径]
Sitemap: [网站地图URL]

常见示例

示例1：完全开放

User-agent: *
Disallow:

示例2：完全禁止

User-agent: *
Disallow: /

示例3：禁止特定目录

User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /tmp/

示例4：允许特定爬虫，禁止其他

# 允许 Google bot 访问
User-agent: Googlebot
Disallow:
# 禁止其他爬虫
User-agent: *
Disallow: /

示例5：混合规则

User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Allow: /images/logo.png
Sitemap: https://example.com/sitemap.xml

常用爬虫名称（User-agent）

- 所有爬虫
Googlebot - Google
Bingbot - Bing
Slurp - Yahoo
DuckDuckBot - DuckDuckGo
Baiduspider - 百度
YandexBot - Yandex

高级用法

抓取延迟

User-agent: *
Crawl-delay: 10
Disallow: /private/

指定精确路径

# 禁止所有图片
User-agent: *
Disallow: /*.jpg$
Disallow: /*.png$
Disallow: /*.gif$

动态参数控制

# 禁止带参数的搜索页面
User-agent: *
Disallow: /search?*
Disallow: /*?page=*

验证工具

Google Search Console - 内置 robots.txt 测试工具
Bing Webmaster Tools - 支持 robots.txt 验证
在线验证工具：
- Google robots.txt Tester
- Robots.txt Checker

最佳实践

✅ 必须做的事

放在网站根目录
使用 UTF-8 编码
每行一个指令
注释清晰明了
定期更新维护

❌ 避免的错误

不要用 robots.txt 保护敏感信息（爬虫可以忽略）
不要过度限制,影响 SEO
不要有语法错误
注意区分大小写

完整示例

# robots.txt for example.com
# Last updated: 2024-01-01
# 允许所有搜索引擎访问公开内容
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /private/
Disallow: /admin/
Disallow: /includes/
# 允许 Google 访问所有内容
User-agent: Googlebot
Disallow:
# 禁止百度访问某些目录
User-agent: Baiduspider
Disallow: /images/
Crawl-delay: 5
# 指向网站地图
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-news.xml