min*_*nur 223
我使用以下代码似乎工作正常:
function _bot_detected() {
return (
isset($_SERVER['HTTP_USER_AGENT'])
&& preg_match('/bot|crawl|slurp|spider|mediapartners/i', $_SERVER['HTTP_USER_AGENT'])
);
}
Run Code Online (Sandbox Code Playgroud)
更新16-06-2017 https://support.google.com/webmasters/answer/1061943?hl=en
增加了媒体合作伙伴
Óla*_*age 75
然后你$_SERVER['HTTP_USER_AGENT'];用来检查代理是否是蜘蛛.
if(strstr(strtolower($_SERVER['HTTP_USER_AGENT']), "googlebot"))
{
// what to do
}
Run Code Online (Sandbox Code Playgroud)
Juk*_*bom 19
检查下面$_SERVER['HTTP_USER_AGENT']列出的一些字符串:
http://www.useragentstring.com/pages/All/
或者更具体地说是爬虫:
http://www.useragentstring.com/pages/Crawlerlist/
如果您想 - 记录大多数常见搜索引擎抓取工具的访问次数,您可以使用
$interestingCrawlers = array( 'google', 'yahoo' );
$pattern = '/(' . implode('|', $interestingCrawlers) .')/';
$matches = array();
$numMatches = preg_match($pattern, strtolower($_SERVER['HTTP_USER_AGENT']), $matches, 'i');
if($numMatches > 0) // Found a match
{
// $matches[1] contains an array of all text matches to either 'google' or 'yahoo'
}
Run Code Online (Sandbox Code Playgroud)
小智 17
你可以查看它是否是具有此功能的搜索引擎:
<?php
function crawlerDetect($USER_AGENT)
{
$crawlers = array(
'Google' => 'Google',
'MSN' => 'msnbot',
'Rambler' => 'Rambler',
'Yahoo' => 'Yahoo',
'AbachoBOT' => 'AbachoBOT',
'accoona' => 'Accoona',
'AcoiRobot' => 'AcoiRobot',
'ASPSeek' => 'ASPSeek',
'CrocCrawler' => 'CrocCrawler',
'Dumbot' => 'Dumbot',
'FAST-WebCrawler' => 'FAST-WebCrawler',
'GeonaBot' => 'GeonaBot',
'Gigabot' => 'Gigabot',
'Lycos spider' => 'Lycos',
'MSRBOT' => 'MSRBOT',
'Altavista robot' => 'Scooter',
'AltaVista robot' => 'Altavista',
'ID-Search Bot' => 'IDBot',
'eStyle Bot' => 'eStyle',
'Scrubby robot' => 'Scrubby',
'Facebook' => 'facebookexternalhit',
);
// to get crawlers string used in function uncomment it
// it is better to save it in string than use implode every time
// global $crawlers
$crawlers_agents = implode('|',$crawlers);
if (strpos($crawlers_agents, $USER_AGENT) === false)
return false;
else {
return TRUE;
}
}
?>
Run Code Online (Sandbox Code Playgroud)
然后你就可以使用它:
<?php $USER_AGENT = $_SERVER['HTTP_USER_AGENT'];
if(crawlerDetect($USER_AGENT)) return "no need to lang redirection";?>
Run Code Online (Sandbox Code Playgroud)
mgu*_*utt 11
我用它来检测机器人:
if (preg_match('/bot|crawl|curl|dataprovider|search|get|spider|find|java|majesticsEO|google|yahoo|teoma|contaxe|yandex|libwww-perl|facebookexternalhit/i', $_SERVER['HTTP_USER_AGENT'])) {
// is bot
}
Run Code Online (Sandbox Code Playgroud)
另外我使用白名单来阻止不需要的机器人:
if (preg_match('/apple|baidu|bingbot|facebookexternalhit|googlebot|-google|ia_archiver|msnbot|naverbot|pingdom|seznambot|slurp|teoma|twitter|yandex|yeti/i', $_SERVER['HTTP_USER_AGENT'])) {
// allowed bot
}
Run Code Online (Sandbox Code Playgroud)
然后,不需要的机器人(=假阳性用户)能够解决验证码以解锁自己24小时.由于没有人解决这个验证码,我知道它不会产生误报.因此机器人检测似乎完美无缺.
注意:我的白名单基于Facebook的robots.txt.
我使用这个函数...正则表达式的一部分来自prestashop但我添加了一些更多的机器人.
public function isBot()
{
$bot_regex = '/BotLink|bingbot|AhrefsBot|ahoy|AlkalineBOT|anthill|appie|arale|araneo|AraybOt|ariadne|arks|ATN_Worldwide|Atomz|bbot|Bjaaland|Ukonline|borg\-bot\/0\.9|boxseabot|bspider|calif|christcrawler|CMC\/0\.01|combine|confuzzledbot|CoolBot|cosmos|Internet Cruiser Robot|cusco|cyberspyder|cydralspider|desertrealm, desert realm|digger|DIIbot|grabber|downloadexpress|DragonBot|dwcp|ecollector|ebiness|elfinbot|esculapio|esther|fastcrawler|FDSE|FELIX IDE|ESI|fido|H?m?h?kki|KIT\-Fireball|fouineur|Freecrawl|gammaSpider|gazz|gcreep|golem|googlebot|griffon|Gromit|gulliver|gulper|hambot|havIndex|hotwired|htdig|iajabot|INGRID\/0\.1|Informant|InfoSpiders|inspectorwww|irobot|Iron33|JBot|jcrawler|Teoma|Jeeves|jobo|image\.kapsi\.net|KDD\-Explorer|ko_yappo_robot|label\-grabber|larbin|legs|Linkidator|linkwalker|Lockon|logo_gif_crawler|marvin|mattie|mediafox|MerzScope|NEC\-MeshExplorer|MindCrawler|udmsearch|moget|Motor|msnbot|muncher|muninn|MuscatFerret|MwdSearch|sharp\-info\-agent|WebMechanic|NetScoop|newscan\-online|ObjectsSearch|Occam|Orbsearch\/1\.0|packrat|pageboy|ParaSite|patric|pegasus|perlcrawler|phpdig|piltdownman|Pimptrain|pjspider|PlumtreeWebAccessor|PortalBSpider|psbot|Getterrobo\-Plus|Raven|RHCS|RixBot|roadrunner|Robbie|robi|RoboCrawl|robofox|Scooter|Search\-AU|searchprocess|Senrigan|Shagseeker|sift|SimBot|Site Valet|skymob|SLCrawler\/2\.0|slurp|ESI|snooper|solbot|speedy|spider_monkey|SpiderBot\/1\.0|spiderline|nil|suke|http:\/\/www\.sygol\.com|tach_bw|TechBOT|templeton|titin|topiclink|UdmSearch|urlck|Valkyrie libwww\-perl|verticrawl|Victoria|void\-bot|Voyager|VWbot_K|crawlpaper|wapspider|WebBandit\/1\.0|webcatcher|T\-H\-U\-N\-D\-E\-R\-S\-T\-O\-N\-E|WebMoose|webquest|webreaper|webs|webspider|WebWalker|wget|winona|whowhere|wlm|WOLP|WWWC|none|XGET|Nederland\.zoek|AISearchBot|woriobot|NetSeer|Nutch|YandexBot|YandexMobileBot|SemrushBot|FatBot|MJ12bot|DotBot|AddThis|baiduspider|SeznamBot|mod_pagespeed|CCBot|openstat.ru\/Bot|m2e/i';
$userAgent = empty($_SERVER['HTTP_USER_AGENT']) ? FALSE : $_SERVER['HTTP_USER_AGENT'];
$isBot = !$userAgent || preg_match($bot_regex, $userAgent);
return $isBot;
}
Run Code Online (Sandbox Code Playgroud)
无论如何要注意一些机器人使用浏览器像用户代理假冒他们的身份
(我有很多俄罗斯IP在我的网站上有这种行为)
大多数机器人的一个显着特点是它们不携带任何cookie,因此没有附加会话.
(我不确定如何,但这肯定是跟踪它们的最佳方式)
因为任何客户都可以将用户代理设置为他们想要的东西,所以寻找'Googlebot','bingbot'等只是工作的一半.
第二部分是验证客户的IP.在过去,这需要维护IP列表.您在网上找到的所有列表都已过时.顶级搜索引擎正式支持通过DNS进行验证,如Google https://support.google.com/webmasters/answer/80553和Bing http://www.bing.com/webmaster/help/how-to-verify所述-bingbot-3905dc26
首先执行客户端IP的反向DNS查找.对于Google,这会在googlebot.com下显示一个主机名,因为Bing会在search.msn.com下.然后,因为有人可以在他的IP上设置这样的反向DNS,您需要使用该主机名上的正向DNS查找进行验证.如果生成的IP与站点访问者的IP相同,则您确定它是来自该搜索引擎的爬虫.
我用Java编写了一个库,为你执行这些检查.随意将其移植到PHP.它位于GitHub上:https://github.com/optimaize/webcrawler-verifier
小智 8
如果您真的需要检测 GOOGLE 引擎机器人,则永远不要依赖“user_agent”或“IP”地址,因为“user_agent”可以根据谷歌的说法进行更改:验证 Googlebot
要将 Googlebot 验证为来电者:
1.使用 host 命令对日志中的访问 IP 地址运行反向 DNS查找。
2.验证域名是否在googlebot.com或google.com
3.对在步骤 1 中检索到的域名使用 host 命令对检索到的域名运行正向 DNS 查找。验证它是否与日志中的原始访问 IP 地址相同。
这是我测试过的代码:
<?php
$remote_add=$_SERVER['REMOTE_ADDR'];
$hostname = gethostbyaddr($remote_add);
$googlebot = 'googlebot.com';
$google = 'google.com';
if (stripos(strrev($hostname), strrev($googlebot)) === 0 or stripos(strrev($hostname),strrev($google)) === 0 )
{
//add your code
}
?>
Run Code Online (Sandbox Code Playgroud)
在这段代码中,我们检查“主机名”,它应该在“主机名”的末尾包含“googlebot.com”或“google.com”,这对于检查确切的域而不是子域非常重要。我希望你喜欢 ;)
您可以分析用户代理($_SERVER['HTTP_USER_AGENT'])或将客户端的IP地址($_SERVER['REMOTE_ADDR'])与搜索引擎机器人的IP地址列表进行比较。
我为此做了一个又好又快的函数
function is_bot(){
if(isset($_SERVER['HTTP_USER_AGENT']))
{
return preg_match('/rambler|abacho|acoi|accona|aspseek|altavista|estyle|scrubby|lycos|geona|ia_archiver|alexa|sogou|skype|facebook|twitter|pinterest|linkedin|naver|bing|google|yahoo|duckduckgo|yandex|baidu|teoma|xing|java\/1.7.0_45|bot|crawl|slurp|spider|mediapartners|\sask\s|\saol\s/i', $_SERVER['HTTP_USER_AGENT']);
}
return false;
}
Run Code Online (Sandbox Code Playgroud)
这涵盖了 99% 的所有可能的机器人、搜索引擎等。