我目前正在开展一个项目,我需要跟踪每个爬虫的访问.
我知道您应该使用HTTP_USER_AGENT,但我不确定如何为此目的格式化代码,我知道USER AGENT可以很容易地更改,所以我也想知道是否可以添加更多参数避免欺骗?
我正在尝试做的示例代码..
<?PHP $user_agent = $_SERVER['HTTP_USER_AGENT']; if (strpos( $user_agent,'Google') !== false) { echo "Googlebot is here"; } ?>
谢谢
You can verify that a bot accessing your server really is Googlebot (or another Google user-agent) by using a reverse DNS lookup,verifying that the name is in the googlebot.com domain,and then doing a forward DNS lookup using that googlebot name. This is useful if you’re concerned that spammers or other troublemakers are accessing your site while claiming to be Googlebot.
For example:
host 66.249.66.1
1.66.249.66.in-addr.arpa domain name pointer
crawl-66-249-66-1.googlebot.com.
host crawl-66-249-66-1.googlebot.com
crawl-66-249-66-1.googlebot.com has address 66.249.66.1
Google doesn’t post a public list of IP addresses for webmasters to whitelist. This is because these IP address ranges can change,causing problems for any webmasters who have hard coded them. The best way to identify accesses by Googlebot is to use the user-agent (Googlebot).您可以执行反向DNS查找:
function validateGoogleBotIP($ip) { $hostname = gethostbyaddr($ip); //"crawl-66-249-66-1.googlebot.com" return preg_match('/\.googlebot\.com$/i',$hostname); } if (strpos($_SERVER['HTTP_USER_AGENT'],'Google') !== false) { if (validateGoogleBotIP($_SERVER['REMOTE_ADDR'])) { echo 'It is ACTUALLY google'; } else { echo 'Someone\'s faking it!'; } } else { echo 'Nothing to do with Google'; }