我已经在网上搜索了正确的答案,但找不到.有很多混合的建议,我不知道哪个是正确/正确的方法.
问题:
(1)每个用户代理可以拥有自己的爬行延迟吗? (我假设是)
(2)在Allow / Dissallow行之前或之后,您将哪个用户代理的爬行延迟线放在哪里?
(3)每个用户代理组之间是否必须有空白.
参考文献:
http://www.seopt.com/2013/01/robots-text-file/
http://help.yandex.com/webmaster/?id=1113851#1113858
本质上,我正在寻找最终的robots.txt文件应该如何使用下面的示例中的值.
提前致谢.
# Allow only major search spiders User-agent: Mediapartners-Google Disallow: Crawl-delay: 11 User-agent: Googlebot Disallow: Crawl-delay: 12 User-agent: Adsbot-Google Disallow: Crawl-delay: 13 User-agent: Googlebot-Image Disallow: Crawl-delay: 14 User-agent: Googlebot-Mobile Disallow: Crawl-delay: 15 User-agent: MSNBot Disallow: Crawl-delay: 16 User-agent: bingbot Disallow: Crawl-delay: 17 User-agent: Slurp Disallow: Crawl-delay: 18 User-agent: Yahoo! Slurp Disallow: Crawl-delay: 19 # Block all other spiders User-agent: * Disallow: / # Block Directories for all spiders User-agent: * Disallow: /ads/ Disallow: /cgi-bin/ Disallow: /scripts/
(4)如果我想设置所有的用户代理具有10秒的抓取延迟,以下是否正确?
# Allow only major search spiders User-agent: * Crawl-delay: 10 User-agent: Mediapartners-Google Disallow: User-agent: Googlebot Disallow: User-agent: Adsbot-Google Disallow: User-agent: Googlebot-Image Disallow: User-agent: Googlebot-Mobile Disallow: User-agent: MSNBot Disallow: User-agent: bingbot Disallow: User-agent: Slurp Disallow: User-agent: Yahoo! Slurp Disallow: # Block all other spiders User-agent: * Disallow: / # Block Directories for all spiders User-agent: * Disallow: /ads/ Disallow: /cgi-bin/ Disallow: /scripts/
解决方法
(1) Can each user agent have it’s own crawl-delay?
是.每个由一个或多个用户代理行开始的记录可以具有抓取延迟行.请注意,爬网延迟不是original robots.txt specification的一部分.但是,对于那些理解它的解析器,将它们包含起来并不是问题,如规范defines:
Unrecognised headers are ignored.
所以较老的robots.txt解析器将会忽略您的爬网延迟线.
(2) Where do you put the crawl-delay line for each user agent,before or after the Allow / Dissallow line?
没关系
(3) Does there have to be a blank like between each user agent group.
是.记录必须由一行或多行新行分隔.见original spec:
The file consists of one or more records separated by one or more blank lines (terminated by CR,CR/NL,or NL).
(4) If I want to set all of the user agents to have crawl delay of 10 seconds,would the following be correct?
编号.机器人查找符合其用户代理的记录.只有当他们没有找到记录时,才会使用User-agent:*记录.所以在你的例子中,所有列出的机器人(如Googlebot,MSNBot,Yahoo! Slurp等)都将没有爬网延迟.
还要注意,你不能有several records with User-agent: *
:
If the value is ‘*’,the record describes the default access policy for any robot that has not matched any of the other records. It is not allowed to have multiple such records in the “/robots.txt” file.
因此,解析器可能会看到(如果没有其他记录匹配)用于User-agent:*的第一个记录,并忽略以下内容.对于您的第一个例子,这意味着以/ ads /,/ cgi-bin /和/ scripts /开头的URL不会被阻止.
即使您只有一个User-agent记录:*,那些Disallow行仅适用于没有其他记录匹配的漫游器!作为您的评论#阻止所有蜘蛛的目录建议,您希望所有蜘蛛都阻止这些URL路径,因此您必须为每个记录重复Disallow行.