从网站禁止机器人

Question

从网站禁止机器人

我的网站经常失败,因为蜘蛛可以访问许多资源.这是主持人告诉我的.他们告诉我禁止这些IP地址:46.229.164.98 46.229.164.100 46.229.164.101

但我不知道如何做到这一点.

我已经google了一下,我现在已经将这些行添加到根目录中的.htaccess:

# allow all except those indicated here
<Files *>
order allow,deny
allow from all
deny from 46.229.164.98
deny from 46.229.164.100
deny from 46.229.164.101
</Files>

Run Code Online (Sandbox Code Playgroud)

这是100%正确吗？我能做什么？请帮我.我真的不知道该怎么办.

Answer 1

Sha*_*rky 25

基于这些

https://www.projecthoneypot.org/ip_46.229.164.98 https://www.projecthoneypot.org/ip_46.229.164.100 https://www.projecthoneypot.org/ip_46.229.164.101

它看起来像机器人是http://www.semrush.com/bot.html

如果那就是机器人,他们会在他们的页面中说出来

To remove our bot from crawling your site simply insert the following lines to your
"robots.txt" file:

User-agent: SemrushBot
Disallow: /

Run Code Online (Sandbox Code Playgroud)

当然,这并不能保证僵尸程序符合规则.你可以通过几种方式阻止他..htaccess是一个.就像你做的那样.

你也可以做这个小技巧,拒绝在用户代理字符串中有"SemrushBot"的任何ip地址

Options +FollowSymlinks  
RewriteEngine On  
RewriteBase /  
SetEnvIfNoCase User-Agent "^SemrushBot" bad_user
SetEnvIfNoCase User-Agent "^WhateverElseBadUserAgentHere" bad_user
Deny from env=bad_user

Run Code Online (Sandbox Code Playgroud)

这种方式将阻止机器人可能使用的其他IP.

有关用户代理字符串阻止的更多信息,请访问:https://stackoverflow.com/a/7372572/953684

我应该补充一点,如果您的网站被蜘蛛瘫痪,通常意味着您有一个写得不好的脚本或一个非常弱的服务器.

编辑:

这条线

SetEnvIfNoCase User-Agent "^SemrushBot" bad_user

Run Code Online (Sandbox Code Playgroud)

如果User-Agent 以字符串开头,则尝试匹配SemrushBot(插入符号^表示"以"开头").如果你想SemrushBot在User-Agent字符串中搜索让我们说ANYWHERE,只需删除插入符就可以了:

SetEnvIfNoCase User-Agent "SemrushBot" bad_user

Run Code Online (Sandbox Code Playgroud)

如果User-Agent包含SemrushBot任何地方的字符串(是的,不需要.*),则上述意味着.

归档时间：	11 年，5 月前
查看次数：	24421 次
最近记录：	9 年，9 月前