robots.txt和sitemap配置-阿杰の博客

robots.txt告诉爬虫哪里能去、哪里不能去。sitemap告诉爬虫你有哪些页面、更新频率、优先级。两个文件配合好，搜索引擎索引效率会高很多。 robots.txt 放在网站根目录，比如 `https://example.com/robots.txt`。nginx或apache不需要额外配置，文件存在即可访问。基本结构：

User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /public/
Sitemap: https://example.com/sitemap.xml

`User-agent: *` 对所有爬虫生效。`Disallow` 禁止访问的路径。`Allow` 在白名单里覆盖Disallow。`Sitemap` 告诉爬虫你的站点地图位置。常见场景： – 禁止爬虫爬后台：`Disallow: /wp-admin/` – 允许所有爬虫：`Disallow:` （留空） – 完全封锁：`Disallow: /` – 针对特定爬虫：`User-agent: Googlebot` 只对谷歌生效注意：robots.txt只是建议，不遵守的爬虫（比如恶意采集）会无视。敏感数据不要只靠robots.txt保护，应该在代码层面做权限校验。测试：谷歌Search Console有robots.txt测试工具，直接粘贴内容就能看到哪些URL被封锁。 sitemap sitemap.xml也是放根目录。格式是XML，但别手写，用工具生成。最简单的sitemap结构：

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/</loc>
    <lastmod>2024-01-15</lastmod>
    <changefreq>daily</changefreq>
    <priority>1.0</priority>
  </url>
  <url>
    <loc>https://example.com/about</loc>
    <lastmod>2024-01-10</lastmod>
    <changefreq>monthly</changefreq>
    <priority>0.8</priority>
  </url>
</urlset>

– `loc`：页面URL，必须完整带协议 – `lastmod`：修改时间，格式`YYYY-MM-DD` – `changefreq`：更新频率，可选`always` `hourly` `daily` `weekly` `monthly` `yearly` `never` – `priority`：优先级0.0-1.0，告诉爬虫哪个页面更重要实际项目中站点地图用程序生成。以Python为例：

import xml.etree.ElementTree as ET
from datetime import datetime

def generate_sitemap(urls):
    urlset = ET.Element('urlset')
    urlset.set('xmlns', 'http://www.sitemaps.org/schemas/sitemap/0.9')
    
    for url_info in urls:
        url = ET.SubElement(urlset, 'url')
        loc = ET.SubElement(url, 'loc')
        loc.text = url_info['loc']
        
        if 'lastmod' in url_info:
            lastmod = ET.SubElement(url, 'lastmod')
            lastmod.text = url_info['lastmod']
        
        if 'changefreq' in url_info:
            changefreq = ET.SubElement(url, 'changefreq')
            changefreq.text = url_info['changefreq']
        
        if 'priority' in url_info:
            priority = ET.SubElement(url, 'priority')
            priority.text = str(url_info['priority'])
    
    tree = ET.ElementTree(urlset)
    tree.write('sitemap.xml', encoding='UTF-8', xml_declaration=True)

# 使用
pages = [
    {'loc': 'https://example.com/', 'lastmod': '2024-01-15', 'changefreq': 'daily', 'priority': 1.0},
    {'loc': 'https://example.com/about', 'lastmod': '2024-01-10', 'changefreq': 'monthly', 'priority': 0.8},
]
generate_sitemap(pages)

站点超过50000个URL或文件大于50MB时，必须拆分成多个sitemap，用sitemap index文件引用：

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://example.com/sitemap1.xml</loc>
    <lastmod>2024-01-15</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemap2.xml</loc>
    <lastmod>2024-01-14</lastmod>
  </sitemap>
</sitemapindex>

提交给搜索引擎 写好了不提交等于白写。手动提交到谷歌Search Console和百度资源平台。谷歌：Search Console -> Sitemaps -> 输入sitemap.xml路径 -> 提交。几分钟后能看到索引状态，比如已提交多少、已索引多少。百度：百度搜索资源平台 -> 站点管理 -> 抓取诊断 -> 输入sitemap地址提交。百度审核比较慢，等半天到两天。如果网站部署在雨云服务器上，直接通过宝塔面板或SSH把robots.txt和sitemap.xml丢到网站根目录就行。雨云服务器响应快，提交后爬虫抓取延迟低，索引效率比放廉价虚拟主机高不少。 常见坑 1. robots.txt里写Sitemap路径，但路径写错。检查是否可访问，直接浏览器打开看能不能下载。 2. sitemap里包含noindex页面或301跳转页面。爬虫会忽略这些，浪费配额。 3. 动态页面没做缓存，每次生成sitemap导致高负载。建议定时用cron生成静态文件。 4. 忘记更新robots.txt里的Sitemap路径。改过域名或目录结构后一定要同步更新。 验证配置是否正确 用curl快速检查：

curl -I https://example.com/robots.txt
# 期望返回200，Content-Type: text/plain

curl -I https://example.com/sitemap.xml
# 期望返回200，Content-Type: application/xml

也可以直接用浏览器访问看内容。配置好robots.txt和sitemap后，新发布的文章或页面，快的一小时内就会被搜索引擎收录。不改的话，爬虫只能靠外链或随机抓取，效率差很多。

雨云是国内一家老牌云服务商，提供高性价比的云服务器和虚拟主机。我用它部署了好几个项目，速度和稳定性都不错。通过 https://www.rainyun.com/SAJA_ 注册可以领一张 5折优惠券，有需要的朋友可以看看。

文章版权归作者所有，未经允许请勿转载。

THE END

未分类

robots.txt和sitemap配置

请登录后发表评论