Jon*_*Jon 43 javascript bots web-crawler
我想知道如何检测搜索爬虫?我问的原因是因为如果用户代理是机器人,我想要禁止某些JavaScript调用.
我找到了一个如何检测某个浏览器的示例,但是找不到如何检测搜索爬虫的示例:
/MSIE (\d+\.\d+);/.test(navigator.userAgent); //test for MSIE x.x
我想阻止的搜索抓取工具示例:
Google
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Googlebot/2.1 (+http://www.googlebot.com/bot.html)
Googlebot/2.1 (+http://www.google.com/bot.html)
Baidu
Baiduspider+(+http://www.baidu.com/search/spider_jp.html)
Baiduspider+(+http://www.baidu.com/search/spider.htm)
BaiDuSpider
Run Code Online (Sandbox Code Playgroud)
meg*_*wac 41
这是ruby UA agent_orange库用于测试userAgent看起来是机器人的正则表达式.您可以通过在此处引用bot userAgent列表来缩小特定机器人的范围:
/bot|crawler|spider|crawling/i
Run Code Online (Sandbox Code Playgroud)
例如,您有一些对象,util.browser您可以存储用户所使用的设备类型:
util.browser = {
bot: /bot|googlebot|crawler|spider|robot|crawling/i.test(navigator.userAgent),
mobile: ...,
desktop: ...
}
Run Code Online (Sandbox Code Playgroud)
Edo*_*Edo 15
根据这篇文章,以下正则表达式将匹配最大的搜索引擎.
/bot|google|baidu|bing|msn|duckduckbot|teoma|slurp|yandex/i
.test(navigator.userAgent)
Run Code Online (Sandbox Code Playgroud)
匹配搜索引擎是:
此外,我已经添加了bot作为小型爬虫/机器人的捕获.
Ser*_*ure 12
试试这个.它基于https://github.com/monperrus/crawler-user-agents上提供的抓取工具列表
var botPattern = "(googlebot\/|Googlebot-Mobile|Googlebot-Image|Google favicon|Mediapartners-Google|bingbot|slurp|java|wget|curl|Commons-HttpClient|Python-urllib|libwww|httpunit|nutch|phpcrawl|msnbot|jyxobot|FAST-WebCrawler|FAST Enterprise Crawler|biglotron|teoma|convera|seekbot|gigablast|exabot|ngbot|ia_archiver|GingerCrawler|webmon |httrack|webcrawler|grub.org|UsineNouvelleCrawler|antibot|netresearchserver|speedy|fluffy|bibnum.bnf|findlink|msrbot|panscient|yacybot|AISearchBot|IOI|ips-agent|tagoobot|MJ12bot|dotbot|woriobot|yanga|buzzbot|mlbot|yandexbot|purebot|Linguee Bot|Voyager|CyberPatrol|voilabot|baiduspider|citeseerxbot|spbot|twengabot|postrank|turnitinbot|scribdbot|page2rss|sitebot|linkdex|Adidxbot|blekkobot|ezooms|dotbot|Mail.RU_Bot|discobot|heritrix|findthatfile|europarchive.org|NerdByNature.Bot|sistrix crawler|ahrefsbot|Aboundex|domaincrawler|wbsearchbot|summify|ccbot|edisterbot|seznambot|ec2linkfinder|gslfbot|aihitbot|intelium_bot|facebookexternalhit|yeti|RetrevoPageAnalyzer|lb-spider|sogou|lssbot|careerbot|wotbox|wocbot|ichiro|DuckDuckBot|lssrocketcrawler|drupact|webcompanycrawler|acoonbot|openindexspider|gnam gnam spider|web-archive-net.com.bot|backlinkcrawler|coccoc|integromedb|content crawler spider|toplistbot|seokicks-robot|it2media-domain-crawler|ip-web-crawler.com|siteexplorer.info|elisabot|proximic|changedetection|blexbot|arabot|WeSEE:Search|niki-bot|CrystalSemanticsBot|rogerbot|360Spider|psbot|InterfaxScanBot|Lipperhey SEO Service|CC Metadata Scaper|g00g1e.net|GrapeshotCrawler|urlappendbot|brainobot|fr-crawler|binlar|SimpleCrawler|Livelapbot|Twitterbot|cXensebot|smtbot|bnf.fr_bot|A6-Indexer|ADmantX|Facebot|Twitterbot|OrangeBot|memorybot|AdvBot|MegaIndex|SemanticScholarBot|ltx71|nerdybot|xovibot|BUbiNG|Qwantify|archive.org_bot|Applebot|TweetmemeBot|crawler4j|findxbot|SemrushBot|yoozBot|lipperhey|y!j-asr|Domain Re-Animator Bot|AddThis)";
var re = new RegExp(botPattern, 'i');
var userAgent = 'Googlebot/2.1 (+http://www.googlebot.com/bot.html)';
if (re.test(userAgent)) {
console.log('the user agent is a crawler!');
}
Run Code Online (Sandbox Code Playgroud)
这可能有助于检测机器人用户代理,同时也使事情更有条理:
JavaScript
const detectRobot = (userAgent) => {
const robots = new RegExp([
/bot/,/spider/,/crawl/, // GENERAL TERMS
/APIs-Google/,/AdsBot/,/Googlebot/, // GOOGLE ROBOTS
/mediapartners/,/Google Favicon/,
/FeedFetcher/,/Google-Read-Aloud/,
/DuplexWeb-Google/,/googleweblight/,
/bing/,/yandex/,/baidu/,/duckduck/,/yahoo/, // OTHER ENGINES
/ecosia/,/ia_archiver/,
/facebook/,/instagram/,/pinterest/,/reddit/, // SOCIAL MEDIA
/slack/,/twitter/,/whatsapp/,/youtube/,
/semrush/, // OTHER
].map((r) => r.source).join("|"),"i"); // BUILD REGEXP + "i" FLAG
return robots.test(userAgent);
};
Run Code Online (Sandbox Code Playgroud)
打字稿
const detectRobot = (userAgent: string): boolean => {
const robots = new RegExp(([
/bot/,/spider/,/crawl/, // GENERAL TERMS
/APIs-Google/,/AdsBot/,/Googlebot/, // GOOGLE ROBOTS
/mediapartners/,/Google Favicon/,
/FeedFetcher/,/Google-Read-Aloud/,
/DuplexWeb-Google/,/googleweblight/,
/bing/,/yandex/,/baidu/,/duckduck/,/yahoo/, // OTHER ENGINES
/ecosia/,/ia_archiver/,
/facebook/,/instagram/,/pinterest/,/reddit/, // SOCIAL MEDIA
/slack/,/twitter/,/whatsapp/,/youtube/,
/semrush/, // OTHER
] as RegExp[]).map((r) => r.source).join("|"),"i"); // BUILD REGEXP + "i" FLAG
return robots.test(userAgent);
};
Run Code Online (Sandbox Code Playgroud)
在服务器上使用:
const userAgent = req.get('user-agent');
const isRobot = detectRobot(userAgent);
Run Code Online (Sandbox Code Playgroud)
在机器人可能使用的“客户端”/某些虚拟浏览器上使用:
const userAgent = navigator.userAgent;
const isRobot = detectRobot(userAgent);
Run Code Online (Sandbox Code Playgroud)
Google 抓取工具概述:
https://developers.google.com/search/docs/advanced/crawling/overview-google-crawlers
| 归档时间: |
|
| 查看次数: |
25492 次 |
| 最近记录: |