web-crawler – 网络爬虫的典型礼貌因素?

前端之家收集整理的这篇文章主要介绍了web-crawler – 网络爬虫的典型礼貌因素?前端之家小编觉得挺不错的,现在分享给大家,也给大家做个参考。
网络爬虫的典型礼貌因素是什么?

除了始终遵守robot.txt
“Disallow:”和非标准“Crawl-delay:”

但是,如果站点未指定显式爬网延迟,则应将默认值设置为什么?

解决方法

我们使用的算法是:
  1. // If we are blocked by robots.txt
  2. // Make sure it is obeyed.
  3. // Our bots user-agent string contains a link to a html page explaining this.
  4. // Also an email address to be added to so that we never even consider their domain in the future
  5.  
  6. // If we receive more that 5 consecutive responses with HTTP response code of 500+ (or timeouts)
  7. // Then we assume the domain is either under heavy load and does not need us adding to it.
  8. // Or the URL we are crawling are completely wrong and causing problems
  9. // Wither way we suspend crawling from this domain for 4 hours.
  10.  
  11. // There is a non-standard parameter in robots.txt that defines a min crawl delay
  12. // If it exists then obey it.
  13. //
  14. // see: http://www.searchtools.com/robots/robots-txt-elements.html
  15. double PolitenssFromRobotsTxt = getRobotPolitness();
  16.  
  17.  
  18. // Work Size politeness
  19. // Large popular domains are designed to handle load so we can use a
  20. // smaller delay on these sites then for smaller domains (thus smaller domains hosted by
  21. // mom and pops by the family PC under the desk in the office are crawled slowly).
  22. //
  23. // But the max delay here is 5 seconds:
  24. //
  25. // domainSize => Range 0 -> 10
  26. //
  27. double workSizeTime = std::min(exp(2.52166863221 + -0.530185027289 * log(domainSize)),5);
  28. //
  29. // You can find out how important we think your site is here:
  30. // http://www.opensiteexplorer.org
  31. // Look at the Domain Authority and diveide by 10.
  32. // Note: This is not exactly the number we use but the two numbers are highly corelated
  33. // Thus it will usually give you a fair indication.
  34.  
  35.  
  36.  
  37. // Take into account the response time of the last request.
  38. // If the server is under heavy load and taking a long time to respond
  39. // then we slow down the requests. Note time-outs are handled above
  40. double responseTime = pow(0.203137637588 + 0.724386103344 * lastResponseTime,2);
  41.  
  42. // Use the slower of the calculated times
  43. double result = std::max(workSizeTime,responseTime);
  44.  
  45. //Never faster than the crawl-delay directive
  46. result = std::max(result,PolitenssFromRobotsTxt);
  47.  
  48.  
  49. // Set a minimum delays
  50. // So never hit a site more than every 10th of a second
  51. result = std::max(result,0.1);
  52.  
  53. // The maximum delay we have is every 2 minutes.
  54. result = std::min(result,120.0)

猜你在找的HTML相关文章