我试图将机器人访问日志和人工访问日志分开,所以我使用以下配置:
http {
....
map $http_user_agent $ifbot {
default 0;
"~*rogerbot" 3;
"~*ChinasoSpider" 3;
"~*Yahoo" 1;
"~*Bot" 1;
"~*Spider" 1;
"~*archive" 1;
"~*search" 1;
"~*Yahoo" 1;
"~Mediapartners-Google" 1;
"~*bingbot" 1;
"~*YandexBot" 1;
"~*Feedly" 2;
"~*Superfeedr" 2;
"~*QuiteRSS" 2;
"~*g2reader" 2;
"~*Digg" 2;
"~*trendiction" 3;
"~*AhrefsBot" 3;
"~*curl" 3;
"~*Ruby" 3;
"~*Player" 3;
"~*Go\ http\ package" 3;
"~*Lynx" 3;
"~*Sleuth" 3;
"~*Python" 3;
"~*Wget" 3;
"~*perl" 3;
"~*httrack" 3;
"~*JikeSpider" 3;
"~*PHP" 3;
"~*WebIndex" 3;
"~*magpie-crawler" 3;
"~*JUC" 3;
"~*Scrapy" 3;
"~*libfetch" 3;
"~*WinHTTrack" 3;
"~*htmlparser" 3;
"~*urllib" 3;
"~*Zeus" 3;
"~*scan" 3;
"~*Indy\ Library" 3;
"~*libwww-perl" 3;
"~*GetRight" 3;
"~*GetWeb!" 3;
"~*Go!Zilla" 3;
"~*Go-Ahead-Got-It" 3;
"~*Download\ Demon" 3;
"~*TurnitinBot" 3;
"~*WebscanSpider" 3;
"~*WebBench" 3;
"~*YisouSpider" 3;
"~*check_http" 3;
"~*webmeup-crawler" 3;
"~*omgili" 3;
"~*blah" 3;
"~*fountainfo" 3;
"~*MicroMessenger" 3;
"~*QQDownload" 3;
"~*shoulu.jike.com" 3;
"~*omgilibot" 3;
"~*pyspider" 3;
}
....
}
Run Code Online (Sandbox Code Playgroud)
在服务器部分,我正在使用:
if ($ifbot = "1") {
set $spiderbot 1;
}
if ($ifbot = "2") {
set $rssbot 1;
}
if ($ifbot = "3") {
return 403;
access_log /web/log/badbot.log main;
}
access_log /web/log/location_access.log main;
access_log /web/log/spider_access.log main if=$spiderbot;
access_log /web/log/rssbot_access.log main if=$rssbot;
Run Code Online (Sandbox Code Playgroud)
不过貌似nginx会在location_access.log和spider_access.log中写入一些robot日志。
如何分离机器人的日志?
还有一个问题是有些机器人日志没有写入spider_access.log,而是存在于location_access.log中。我的地图似乎不起作用。我定义“地图”时有什么问题吗?
您正在突破条件 Nginx 的限制if,它的目的是最小化使用。
考虑使用 Rsyslog 来跟踪您的 Nginx 访问日志。Rsyslog 具有强大的选项,用于匹配日志字符串的内容并将其结果发送到不同的日志。然后您就可以获得您正在寻找的三个单独的日志。