如何防止网站抓取？

Question

如何防止网站抓取？

pix*_*xel 287 html architecture screen-scraping piracy-prevention

我有一个相当大的音乐网站,有一个大型的艺术家数据库.我一直在注意其他音乐网站抓取我们网站的数据(我在这里和那里输入虚拟艺术家名称然后谷歌搜索它们).

如何防止屏幕抓取？它甚至可能吗？

Answer 1

Jon*_*ica 300

注意:由于此答案的完整版本超出了Stack Overflow的长度限制,因此您需要前往GitHub阅读扩展版本,并提供更多提示和详细信息.

为了阻止抓取(也称为Webscraping,Screenscraping,Web数据挖掘,Web收集或Web数据提取),有助于了解这些抓取工具的工作方式,并且通过扩展,有助于防止它们正常工作.

有各种类型的刮刀,每种刮刀的工作方式不同:

蜘蛛,如Google的机器人或网站复印机,如HTtrack,它递归地跟随其他页面的链接以获取数据.这些有时用于有针对性的抓取以获取特定数据,通常与HTML解析器结合以从每个页面提取所需数据.
Shell脚本:有时,常用的Unix工具用于抓取:Wget或Curl下载页面,Grep(Regex)用于提取数据.
HTML解析器,例如基于Jsoup,Scrapy和其他的解析器.类似于基于shell脚本的正则表达式,它们通过基于HTML中的模式从页面中提取数据来工作,通常忽略其他所有内容.

例如:如果您的网站有搜索功能,这样的抓取工具可能会提交搜索请求,然后从结果页面HTML中获取所有结果链接及其标题,以便专门获取搜索结果链接及其标题.这些是最常见的.
Screenscrapers,基于例如.Selenium或PhantomJS,在真实浏览器中打开您的网站,运行JavaScript,AJAX等,然后从网页上获取所需的文本,通常是:
- 在页面加载并运行JavaScript之后从浏览器获取HTML,然后使用HTML解析器提取所需数据.这些是最常见的,因此许多破解HTML解析器/抓取器的方法也适用于此.
- 获取渲染页面的屏幕截图,然后使用OCR从屏幕截图中提取所需文本.这些是罕见的,只有真正想要您的数据的专用刮刀才会设置它.
Web Scraping服务,如ScrapingHub或Kimono.事实上,有些人的工作是弄清楚如何刮取您的网站并提取其他人使用的内容.

不出所料,专业的抓取服务是最难阻止的,但是如果你弄清楚如何刮擦你的网站会很困难和耗时,那些(以及支付他们这样做的人)可能不会费心去刮你的网站.
将您的网站嵌入到包含框架的其他网站的网页中,并将您的网站嵌入到移动应用中.

虽然不是技术上的刮擦,但移动应用程序(Android和iOS)可以嵌入网站,并注入自定义CSS和JavaScript,从而彻底改变页面的外观.
人工复制 - 粘贴:人们会复制并粘贴您的内容,以便在其他地方使用.

这些不同类型的刮刀之间存在很多重叠,即使使用不同的技术和方法,许多刮刀的行为也会相似.

这些提示主要是我自己的想法,我在编写刮刀时遇到的各种困难,以及来自互联网周围的一些信息和想法.

如何停止刮擦

你不能完全阻止它,因为无论你做什么,坚定的刮刀仍然可以弄清楚如何刮.但是,您可以通过执行以下操作来停止大量搜索:

监控您的日志和流量模式; 如果您看到异常活动,则限制访问:

定期检查日志,如果有异常活动表明自动访问(刮刀),例如来自同一IP地址的许多类似操作,您可以阻止或限制访问.

具体来说,一些想法:

限速:

仅允许用户(和刮刀)在特定时间内执行有限数量的操作 - 例如,每秒只允许从任何特定IP地址或用户进行几次搜索.这会减慢刮刀的速度,使它们无效.如果操作的完成速度太快或太快,您还可以显示验证码.
检测异常活动:

如果您看到异常活动,例如来自特定IP地址的许多类似请求,某人查看过多页面或执行异常数量的搜索,您可以阻止访问,或显示后续请求的验证码.
不要只监控IP地址和速率限制 - 使用其他指标:

如果你阻止或限制速率,不要只按每个IP地址执行; 您可以使用其他指标和方法来识别特定用户或刮刀.一些可以帮助您识别特定用户/刮刀的指标包括:
- 用户填写表单的速度,以及单击按钮的位置;
- 您可以使用JavaScript收集大量信息,例如屏幕大小/分辨率,时区,安装的字体等; 您可以使用它来识别用户.
- HTTP标头及其顺序,尤其是User-Agent.
As an example, if you get many request from a single IP address, all using the same User Agent, screen size (determined with JavaScript), and the user (scraper in this case) always clicks on the button in the same way and at regular intervals, it's probably a screen scraper; and you can temporarily block similar requests (eg. block all requests with that user agent and screen size coming from that particular IP address), and this way you won't inconvenience real users on that IP address, eg. in case of a shared internet connection.

You can also take this further, as you can identify similar requests, even if they come from different IP addresses, indicative of distributed scraping (a scraper using a botnet or a network of proxies). If you get a lot of otherwise identical requests, but they come from different IP addresses, you can block. Again, be aware of not inadvertently blocking real users.

This can be effective against screenscrapers which run JavaScript, as you can get a lot of information from them.

Related questions on Security Stack Exchange:
- How to uniquely identify users with the same external IP address? for more details, and
- Why do people use IP address bans when IP addresses often change? for info on the limits of these methods.
Instead of temporarily blocking access, use a Captcha:

The simple way to implement rate-limiting would be to temporarily block access for a certain amount of time, however using a Captcha may be better, see the section on Captchas further down.

Require registration & login

Require account creation in order to view your content, if this is feasible for your site. This is a good deterrent for scrapers, but is also a good deterrent for real users.

If you require account creation and login, you can accurately track user and scraper actions. This way, you can easily detect when a specific account is being used for scraping, and ban it. Things like rate limiting or detecting abuse (such as a huge number of searches in a short time) become easier, as you can identify specific scrapers instead of just IP addresses.

In order to avoid scripts creating many accounts, you should:

Require an email address for registration, and verify that email address by sending a link that must be opened in order to activate the account. Allow only one account per email address.
Require a captcha to be solved during registration/account creation.

Requiring account creation to view content will drive users and search engines away; if you require account creation in order to view an article, users will go elsewhere.

Block access from cloud hosting and scraping service IP addresses

Sometimes, scrapers will be run from web hosting services, such as Amazon Web Services or GAE, or VPSes. Limit access to your website (or show a captcha) for requests originating from the IP addresses used by such cloud hosting services.

Similarly, you can also limit access from IP addresses used by proxy or VPN providers, as scrapers may use such proxy servers to avoid many requests being detected.

Beware that by blocking access from proxy servers and VPNs, you will negatively affect real users.

Make your error message nondescript if you do block

If you do block/limit access, you should ensure that you don't tell the scraper what caused the block, thereby giving them clues as to how to fix their scraper. So a bad idea would be to show error pages with text like:

Too many requests from your IP address, please try again later.
Error, User Agent header not present !

Instead, show a friendly error message that doesn't tell the scraper what caused it. Something like this is much better:

Sorry, something went wrong. You can contact support via helpdesk@example.com, should the problem persist.

This is also a lot more user friendly for real users, should they ever see such an error page. You should also consider showing a captcha for subsequent requests instead of a hard block, in case a real user sees the error message, so that you don't block and thus cause legitimate users to contact you.

Use Captchas if you suspect that your website is being accessed by a scraper.

Captchas ("Completely Automated Test to Tell Computers and Humans apart") are very effective against stopping scrapers. Unfortunately, they are also very effective at irritating users.

As such, they are useful when you suspect a possible scraper, and want to stop the scraping, without also blocking access in case it isn't a scraper but a real user. You might want to consider showing a captcha before allowing access to the content if you suspect a scraper.

Things to be aware of when using Captchas:

Don't roll your own, use something like Google's reCaptcha : It's a lot easier than implementing a captcha yourself, it's more user-friendly than some blurry and warped text solution you might come up with yourself (users often only need to tick a box), and it's also a lot harder for a scripter to solve than a simple image served from your site
Don't include the solution to the captcha in the HTML markup: I've actually seen one website which had the solution for the captcha in the page itself, (although quite well hidden) thus making it pretty useless. Don't do something like this. Again, use a service like reCaptcha, and you won't have this kind of problem (if you use it properly).
Captchas can be solved in bulk: There are captcha-solving services where actual, low-paid, humans solve captchas in bulk. Again, using reCaptcha is a good idea here, as they have protections (such as the relatively short time the user has in order to solve the captcha). This kind of service is unlikely to be used unless your data is really valuable.

Serve your text content as an image

You can render text into an image server-side, and serve that to be displayed, which will hinder simple scrapers extracting text.

However, this is bad for screen readers, search engines, performance, and pretty much everything else. It's also illegal in some places (due to accessibility, eg. the Americans with Disabilities Act), and it's also easy to circumvent with some OCR, so don't do it.

You can do something similar with CSS sprites, but that suffers from the same problems.

Don't expose your complete dataset:

If feasible, don't provide a way for a script/bot to get all of your dataset. As an example: You have a news site, with lots of individual articles. You could make those articles be only accessible by searching for them via the on site search, and, if you don't have a list of all the articles on the site and their URLs anywhere, those articles will be only accessible by using the search feature. This means that a script wanting to get all the articles off your site will have to do searches for all possible phrases which may appear in your articles in order to find them all, which will be time-consuming, horribly inefficient, and will hopefully make the scraper give up.

This will be ineffective if:

The bot/script does not want/need the full dataset anyway.
Your articles are served from a URL which looks something like example.com/article.php?articleId=12345. This (and similar things) which will allow scrapers to simply iterate over all the articleIds and request all the articles that way.
There are other ways to eventually find all the articles, such as by writing a script to follow links within articles which lead to other articles.
Searching for something like "and" or "the" can reveal almost everything, so that is something to be aware of. (You can avoid this by only returning the top 10 or 20 results).
You need search engines to find your content.

Don't expose your APIs, endpoints, and similar things:

Make sure you don't expose any APIs, even unintentionally. For example, if you are using AJAX or network requests from within Adobe Flash or Java Applets (God forbid!) to load your data it is trivial to look at the network requests from the page and figure out where those requests are going to, and then reverse engineer and use those endpoints in a scraper program. Make sure you obfuscate your endpoints and make them hard for others to use, as described.

To deter HTML parsers and scrapers:

Since HTML parsers work by extracting content from pages based on identifiable patterns in the HTML, we can intentionally change those patterns in oder to break these scrapers, or even screw with them. Most of these tips also apply to other scrapers like spiders and screenscrapers too.

Frequently change your HTML

Scrapers which process HTML directly do so by extracting contents from specific, identifiable parts of your HTML page. For example: If all pages on your website have a div with an id of article-content, which contains the text of the article, then it is trivial to write a script to visit all the article pages on your site, and extract the content text of the article-content div on each article page, and voilà, the scraper has all the articles from your site in a format that can be reused elsewhere.

If you change the HTML and the structure of your pages frequently, such scrapers will no longer work.

You can frequently change the id's and classes of elements in your HTML, perhaps even automatically. So, if your div.article-content becomes something like div.a4c36dda13eaf0, and changes every week, the scraper will work fine initially, but will break after a week. Make sure to change the length of your ids/classes too, otherwise the scraper will use div.[any-14-characters] to find the desired div instead. Beware of other similar holes too..
If there is no way to find the desired content from the markup, the scraper will do so from the way the HTML is structured. So, if all your article pages are similar in that every div inside a div which comes after a h1 is the article content, scrapers will get the article content based on that. Again, to break this, you can add/remove extra markup to your HTML, periodically and randomly, eg. adding extra divs or spans. With modern server side HTML processing, this should not be too hard.

Things to be aware of:

It will be tedious and difficult to implement, maintain, and debug.
You will hinder caching. Especially if you change ids or classes of your HTML elements, this will require corresponding changes in your CSS and JavaScript files, which means that every time you change them, they will have to be re-downloaded by the browser. This will result in longer page load times for repeat visitors, and increased server load. If you only change it once a week, it will not be a big problem.
Clever scrapers will still be able to get your content by inferring where the actual content is, eg. by knowing that a large single block of text on the page is likely to be the actual article. This makes it possible to still find & extract the desired data from the page. Boilerpipe does exactly this.

Essentially, make sure that it is not easy for a script to find the actual, desired content for every similar page.

See also How to prevent crawlers depending on XPath from getting page contents for details on how this can be implemented in PHP.

Change your HTML based on the user's location

This is sort of similar to the previous tip. If you serve different HTML based on your user's location/country (determined by IP address), this may break scrapers which are delivered to users. For example, if someone is writing a mobile app which scrapes data from your site, it will work fine initially, but break when it's actually distributed to users, as those users may be in a different country, and thus get different HTML, which the embedded scraper was not designed to consume.

Frequently change your HTML, actively screw with the scrapers by doing so !

An example: You have a search feature on your website, located at example.com/search?query=somesearchquery, which returns the following HTML:

<div class="search-result">
  <h3 class="search-result-title">Stack Overflow has become the world's most popular programming Q & A website</h3>
  <p class="search-result-excerpt">The website Stack Overflow has now become the most popular programming Q & A website, with 10 million questions and many users, which...</p>
  <a class"search-result-link" href="/stories/story-link">Read more</a>
</div>
(And so on, lots more identically structured divs with search results)

Run Code Online (Sandbox Code Playgroud)

As you may have guessed this is easy to scrape: all a scraper needs to do is hit the search URL with a query, and extract the desired data from the returned HTML. In addition to periodically changing the HTML as described above, you could also leave the old markup with the old ids and classes in, hide it with CSS, and fill it with fake data, thereby poisoning the scraper. Here's how the search results page could be changed:

<div class="the-real-search-result">
  <h3 class="the-real-search-result-title">Stack Overflow has become the world's most popular programming Q & A website</h3>
  <p class="the-real-search-result-excerpt">The website Stack Overflow has now become the most popular programming Q & A website, with 10 million questions and many users, which...</p>
  <a class"the-real-search-result-link" href="/stories/story-link">Read more</a>
</div>

<div class="search-result" style="display:none">
  <h3 class="search-result-title">Visit Example.com now, for all the latest Stack Overflow related news !</h3>
  <p class="search-result-excerpt">Example.com is so awesome, visit now !</p>
  <a class"search-result-link" href="http://example.com/">Visit Now !</a>
</div>
(More real search results follow)

Run Code Online (Sandbox Code Playgroud)

This will mean that scrapers written to extract data from the HTML based on classes or IDs will continue to seemingly work, but they will get fake data or even ads, data which real users will never see, as they're hidden with CSS.

Screw with the scraper: Insert fake, invisible honeypot data into your page

Adding on to the previous example, you can add invisible honeypot items to your HTML to catch scrapers. An example which could be added to the previously described search results page:

<div class="search-result" style="display:none">
  <h3 class="search-result-title">This search result is here to prevent scraping</h3>
  <p class="search-result-excerpt">If you're a human and see this, please ignore it. If you're a scraper, please click the link below :-)
  Note that clicking the link below will block access to this site for 24 hours.</p>
  <a class"search-result-link" href="/scrapertrap/scrapertrap.php">I'm a scraper !</a>
</div>
(The actual, real, search results follow.)

Run Code Online (Sandbox Code Playgroud)

A scraper written to get all the search results will pick this up, just like any of the other, real search results on the page, and visit the link, looking for the desired content. A real human will never even see it in the first place (due to it being hidden with CSS), and won't visit the link. A genuine and desirable spider such as Google's will not visit the link either because you disallowed /scrapertrap/ in your robots.txt.

You can make your scrapertrap.php do something like block access for the IP address that visited it or force a captcha for all subsequent requests from that IP.

Don't forget to disallow your honeypot (/scrapertrap/) in your robots.txt file so that search engine bots don't fall into it.
You can/should combine this with the previous tip of changing your HTML frequently.
Change this frequently too, as scrapers will eventually learn to avoid it. Change the honeypot URL and text. Also want to consider changing the inline CSS used for hiding, and use an ID attribute and external CSS instead, as scrapers will learn to avoid anything which has a style attribute with CSS used to hide the content. Also try only enabling it sometimes, so the scraper works initially, but breaks after a while. This also applies to the previous tip.
Malicious people can prevent access for real users by sharing a link to your honeypot, or even embedding that link somewhere as an image (eg. on a forum). Change the URL frequently, and make any ban times relatively short.

Serve fake and useless data if you detect a scraper

If you detect what is obviously a scraper, you can serve up fake and useless data; this will corrupt the data the scraper gets from your website. You should also make it impossible to distinguish such fake data from real data, so that scrapers don't know that they're being screwed with.

As an example: you have a news website; if you detect a scraper, instead of blocking access, serve up fake, randomly generated articles, and this will poison the data the scraper gets. If you make your fake data indistinguishable from the real thing, you'll make it hard for scrapers to get what they want, namely the actual, real data.

Don't accept requests if the User Agent is empty/missing

Often, lazily written scrapers will not send a User Agent header with their request, whereas all browsers as well as search engine spiders will.

If you get a request where the User Agent header is not present, you can show a captcha, or simply block or limit access. (Or serve fake data as described above, or something else..)

It's trivial to spoof, but as a measure against poorly written scrapers it is worth implementing.

Don't accept requests if the User Agent is a common scraper one; blacklist ones used by scrapers

[这](http://meta.stackoverflow.com/questions/316012/the-answer-im-writing-exceeds-the-30k-maximum-character-limit-what-should-i-do?cb=1)带我到这.相当令人印象深刻的答案.此外,相当惊人的修订历史.谢谢你的帖子.你得到了一个upvote.不仅因为投入了大量精力,而且因为它对我有用. (10认同)

@JonH,如果他们感兴趣,他们会读它.另外,我已将其分为带有标题和小标题的段落,因此人们可以扫描并阅读他们想要的部分.事实上,在SO上有很多类似的长答案,人们会阅读它们. (5认同)

@JoshCrozier - 我认为像这样的网站不能很好地利用这些信息.我不是说信息不好. (2认同)

PS我的内容构思的隐写指纹可能会被用于法庭.想象一下,当您通过数据中的独特特征证明您的数据拥有者从您那里获得它时,您会感到震惊...... (2认同)

Answer 2

Dan*_*ien 239

我会假设你已经成立了robots.txt.

正如其他人所提到的,刮刀几乎可以伪造他们活动的每个方面,而且可能很难识别来自坏人的请求.

我会考虑:

设置页面,/jail.html.
禁止访问该页面robots.txt(因此尊重的蜘蛛永远不会访问).
在您的某个页面上放置一个链接,用CSS(display: none)隐藏它.
记录访问者的IP地址/jail.html.

这可能有助于您快速识别来自刮刀的请求,这些请求是公然无视您的robots.txt.

你可能也想使你的/jail.html整个整个网站具有相同的,准确的标记为正常的网页,而是用假数据(/jail/album/63ajdka,/jail/track/3aads8等).这样,在你有机会完全阻止它们之前,坏的刮刀不会被警告"异常输入".

我之前看过这种技术被称为"蜜罐".这也是一种用于垃圾邮件过滤的技术,您可以在页面上放置电子邮件地址,但要将其隐藏或明确表示不能向人们发送合法邮件.然后收集将邮件传递到该地址的任何邮件服务器的IP地址. (48认同)
投入蜜罐的另一个令人敬畏的事情是teergrubing(或tarpitting).这是一种我喜欢的旧技术 - 当你识别出一个坏人时,你可以通过有目的地保持他的连接打开而无需计时,将你的垃圾邮件/抓取过程带到爬行中.当然,这可能会提醒他们你也和他们在一起,但是天哪,这很有趣.http://en.wikipedia.org/wiki/Teergrubing (18认同)
这假设他们正在抓取链接.大多数刮刀会尝试提交某种形式并刮掉返回的数据. (11认同)
这种方法的唯一问题是如果我在一个热门论坛上放置[img] http://yoursite/jail.html [/ img].您将收到登录系统的吨IP,并且很难过滤哪一个是坏的.如果要阻止此类事件,则需要在URL中添加与IP关联的令牌.像jail.php？t = hoeyvm之类的东西,在数据库中你有一个hoeyvm和请求页面的IP的关联. (11认同)
我已经看过基于Perl的蜜罐电子邮件,它包含了由Perl脚本生成的其他"页面"的链接.读取robots.txt的合法机器人不会看它,而是通过CSS向用户隐藏它,但是刮刀(或电子邮件收集器)很快就会陷入无限深度的页面树中,所有页面上的数据都很糟糕.在每个页面的开头放置一个指向脚本的链接. (9认同)
robots.txt - >伪数据实际上非常精彩,假设它有效:它与攻击的性质和过度使用有关.你可以做各种相关的事情.例如,在robots.txt中排除某些内容并使用人类无法看到的无形彩色链接链接到该链接.现在唯一能够到达那里的代理人就是你的刮刀. (7认同)
我从来没有来过一个数据库,我无法用一个简单的脚本刮掉它的输出页面...我做的最后一个人试图阻碍标记,"下一页"的方法很尴尬,它存储会话变量和cookie,检查时间模式...我做了整个网站可能... 10行perl或更少直接到CSV.如果他阻止了我,我可以在一天之内轻松地从另一个IP回来.但是,让我们在这里真实,如果你把信息放在网上,就在那里,你唯一真实的行为是合法的. (6认同)
这被称为"蜜罐"技术,它对常规机器人有效.但它对***数据挖掘者*无效*,例如针对将编写专门的特定于站点的数据挖掘抓取脚本的人. (5认同)
那么，如果合法用户禁用了 CSS 并点击了您所谓的不可见的蜜罐链接呢？ (2认同)

Answer 3

Uni*_*ron 47

苏,他们.

说真的:如果你有钱,可以和一位了解互联网的好的,年轻的律师交谈.你真的可以在这里做点什么.根据网站的所在地,您可以让律师在您所在的国家/地区写下停止和终止协议.你或许可以至少吓唬那些混蛋.

记录您的虚拟值的插入.插入明确(但模糊地)指向您的虚拟值.我认为这是电话簿公司的常见做法,而在德国,我认为有几个例子,当模仿者被他们以1:1复制的假条目破坏时.

如果这会让你弄乱你的HTML代码,拖累搜索引擎优化,有效性和其他东西(尽管在每个请求中对相同页面使用稍微不同的HTML结构的模板系统)可能已经帮助了很多,这将是一种耻辱.总是依赖HTML结构和类/ ID名称来获取内容的scraper.)

像这样的案件是版权法的好处.剥夺其他人的诚实工作以赚钱是你应该能够对抗的事情.

仅适用于具有扎实法律框架的国家/地区. (9认同)
我同意@TomL.如果他们在西方,这有点合情合理.但如果他们在印度/中国/俄罗斯/乌克兰/无论如何 - 那么,严肃地说,最小也没有机会.我可以说俄罗斯法院:他们甚至不愿意为你的主张工作. (3认同)
律师在冲突中茁壮成长 - 并从中获利.很少有律师建议你不要去法庭.任何有过这种情况的人都会告诉你,输赢并没有任何关于"正义"的精美概念,而是与当时的论点,情绪和偏见无关.请记住,如果出现问题,您不仅要对您的律师费用负责,还要对其他方负责,如果他们决定反诉 - 那么.您可能很容易失去家中和生活中的任何其他资产.我建议不要赌博.我建议你不惜一切代价避开法庭. (2认同)

Answer 4

rye*_*guy 35

你真的无法完全阻止这一点.铲运机可以伪造其用户代理,使用多个IP地址等,并以普通用户身份出现.您唯一能做的就是在加载页面时使文本不可用 - 使用图像,flash或使用JavaScript加载它.但是,前两个是糟糕的想法,如果没有为某些常规用户启用JavaScript,则最后一个是可访问性问题.

如果他们绝对抨击您的网站并遍历您的所有网页,您可以进行某种速率限制.

虽然有一些希望.铲运机依赖于您网站的数据采用一致的格式.如果你能以某种方式随机化它可能会破坏他们的刮刀.比如在每次加载时更改页面元素的ID或类名等等.但这是很多工作要做,我不确定它是否值得.即便如此,他们也可能以足够的奉献精神来解决这个问题.

创建一个限制IP每分钟可以查看多少页面的系统是一个很好的黑客,因为屏幕抓取工具将比任何普通人快得多地浏览网站. (14认同)

Answer 5

Wil*_*and 31

提供XML API以访问您的数据; 以一种易于使用的方式.如果人们想要你的数据,他们会得到它,你也可以全力以赴.

通过这种方式,您可以以有效的方式提供功能的子集,确保至少刮刀不会提高HTTP请求和大量带宽.

然后,您所要做的就是说服那些希望您的数据使用API的人.;)

@alecwh:并收取访问费用! (6认同)
这似乎很合理.屏幕抓取很难防止,如果您提供API,您可以对其进行一些限制,添加通知("来自----.com的内容"),并基本上控制给出的数据. (3认同)
我授予您赏金，部分原因是，如果每个网站都这样做，网络会变得更好。希望它变得更加普遍。 (2认同)
一旦您让他们注册该服务，他们就会返回正常站点 (2认同)

Answer 6

Liz*_*ard 21

对不起,这真的很难......

我建议您礼貌地要求他们不要使用您的内容(如果您的内容受版权保护).

如果是,并且他们没有把它取下来,那么你可以采取进一步的行动并向他们发送停止和终止信.

一般来说,无论你采取什么措施来防止刮擦都可能会产生更多负面影响,例如可访问性,机器人/蜘蛛等.

Answer 7

Ars*_*eep 12

好吧,正如所有帖子所说,如果你想让它对搜索引擎友好,那么机器人可以肯定.

但你仍然可以做一些事情,它可能对60-70%的刮刀机器人有效.

制作如下的检查脚本.

如果某个特定的IP地址访问速度非常快,那么在几次访问(5-10)后,将其IP地址+浏览器信息放入文件或数据库中.

下一步

(这将是一个后台进程并在几分钟后运行或安排.)制作另一个脚本,继续检查那些可疑的IP地址.

案例1.如果用户代理是Google,Bing,Yahoo等已知搜索引擎(您可以通过Google搜索找到有关用户代理的更多信息).然后你必须看到http://www.iplists.com/.此列表并尝试匹配模式.如果它看起来像一个伪造的用户代理,那么请在下次访问时填写验证码.(你需要对机器人的IP地址进行更多的研究.我知道这是可以实现的,也可以尝试使用IP地址.这可能会有所帮助.)

案例2.没有搜索机器人的用户代理:只需要在下次访问时填写验证码.

Answer 8

jm6*_*666 10

迟到的答案 - 而且这个答案可能不是你想听到的那个......

我自己已经编写了许多(数十个)不同的专业数据挖掘工具.(仅仅因为我喜欢"开放数据"哲学).

这里已有许多其他答案的建议 - 现在我将扮演魔鬼的倡导者角色,并将扩展和/或纠正他们的有效性.

第一:

如果有人真的想要你的数据
你不能有效地(技术上)隐藏你的数据
如果您的"普通用户" 可以公开访问这些数据

试图使用一些技术障碍是不值得的麻烦,造成:

通过恶化用户体验给您的普通用户
定期和欢迎机器人(搜索引擎)
等等...

简单的HMTL - 最简单的方法是解析纯HTML页面,具有明确定义的结构和css类.例如,使用Firebug检查元素就足够了,并在我的scraper中使用正确的Xpath和/或CSS路径.

您可以动态生成HTML结构,也可以动态生成CSS类名(以及CSS本身)(例如,通过使用一些随机类名) - 但是

您希望以一致的方式向常规用户呈现信息
例如,再次 - 足以再次分析页面结构来设置刮刀.
它可以通过分析一些"已知内容"自动完成
- 一旦有人已经知道(通过早期刮),例如:
- 什么包含有关"phil collins"的信息
- 足够显示"phil collins"页面并(自动)分析页面的结构"今天":)

您无法更改每个响应的结构,因为您的常规用户会讨厌您.此外,这将为您(维护)而不是刮刀造成更多麻烦.XPath或CSS路径可以通过刮擦脚本自动从已知内容中确定.

Ajax - 一开始有点难,但很多时候加快了抓取过程:) - 为什么？

在分析请求和响应时,我只是设置了自己的代理服务器(用perl编写),而我的Firefox正在使用它.当然,因为它是我自己的代理 - 它是完全隐藏的 - 目标服务器将其视为常规浏览器.(所以,没有X-Forwarded-for和这样的标题).基于代理日志,大多数情况下可以确定ajax请求的"逻辑",例如我可以跳过大部分html抓取,并且只使用结构良好的ajax响应(主要是JSON格式).

所以,ajax没有多大帮助......

一些更复杂的页面使用了很多 打包的javascript函数.

这里可以使用两种基本方法:

解压缩并理解JS并创建一个遵循Javascript逻辑的刮刀(艰难的方式)
或者(最好由我自己使用) - 只需使用Mozilla和Mozrepl进行刮擦.例如,真正的抓取是在全功能的启用javascript的浏览器中完成的,该浏览器被编程为点击正确的元素并且直接从浏览器窗口抓取"解码的"响应.

这样的抓取速度很慢(抓取工作与普通浏览器一样),但确实如此

非常容易设置和使用
反对它几乎是不可能的:)
并且无论如何都需要"缓慢"来对抗"阻止快速相同的基于IP的请求"

基于用户代理的过滤根本没有帮助.任何认真的数据挖掘者都会在他的刮刀中将其设置为正确的数据.

需要登录 - 没有帮助.最简单的方法是打败它(没有任何分析和/或脚本登录协议)只是以常规用户身份登录网站,使用Mozilla并在运行基于Mozrepl的刮刀后......

请记住,require登录对匿名机器人有帮助,但对于想要抓取数据的人没有帮助.他只是以常规用户身份登记到您的网站.

使用框架也不是很有效.这被许多现场电影服务使用,并不是很难被击败.框架只是另一个需要分析的HTML/Javascript页面...如果数据值得麻烦 - 数据挖掘者将进行必要的分析.

基于IP的限制根本没有效果 - 这里有太多的公共代理服务器,这里也是TOR ...... :)它不会减慢抓取速度(对于真正想要你的数据的人来说).

非常困难的是隐藏在图像中的数据.(例如,简单地将数据转换为服务器端的图像).使用"tesseract"(OCR)可以帮助很多次 - 但老实说 - 数据必须值得刮刀的麻烦.(这很多次都不值得).

另一方面,您的用户会因此而讨厌您.我自己,(即使不刮)讨厌不允许将页面内容复制到剪贴板的网站(因为信息在图像中,或者(愚蠢的)试图绑定到右键单击一些自定义Javascript事件. )

最难的是使用java applet或flash的站点,applet在内部使用安全的https请求.但请三思而后行 - 您的iPhone用户会有多开心......;).因此,目前很少有网站使用它们.我自己,阻止浏览器中的所有Flash内容(在常规浏览会话中) - 并且从不使用依赖于Flash的网站.

你的里程碑可能是......,所以你可以尝试这种方法 - 只记得 - 你可能会失去一些用户.还记得,一些SWF文件是可以解压缩的.;)

Captcha(好的 - 像reCaptcha)有很多帮助 - 但是你的用户会讨厌你...... - 想象一下,当用户需要在显示音乐艺术家信息的所有页面中解决一些验证码时,你的用户会如何爱你.

可能不需要继续 - 你已经进入了画面.

现在你应该做什么:

请记住:如果您希望将其(以友好的方式)发布给普通用户,则几乎不可能隐藏您的数据.

所以,

通过某些API轻松访问您的数据
- 这样可以轻松访问数据
- 例如,卸载服务器免于刮擦 - 对你有好处
设置正确的使用权限(例如,必须引用来源)
请记住,许多数据不具有版权 - 并且很难保护它们
添加一些虚假数据(如您所做)并使用合法工具
- 正如其他人已经说过的那样,发送"停止和终止信"
- 其他法律诉讼(诉讼等)可能成本太高而且难以取胜(特别是针对非美国网站)

在尝试使用某些技术障碍之前请三思而后行.

而不是试图阻止数据挖掘者,只需为您的网站可用性添加更多努力.你的用户会爱你.投入技术障碍的时间(和能源)通常不值得 - 更好地花时间制作更好的网站......

此外,数据窃贼不像普通小偷.

如果你买一个廉价的家庭警报并添加一个警告"这个房子与警察联系" - 许多小偷甚至不会试图闯入.因为他有一个错误的举动 - 他要坐牢......

所以,你只投资几块钱,但小偷投资和风险很大.

但数据窃贼没有这样的风险.恰恰相反 - 如果你做出错误的举动(例如,如果你因为技术障碍而引入了一些BUG),那么你将失去用户.如果抓取机器人第一次不起作用,没有任何反应 - 数据挖掘者只会尝试另一种方法和/或调试脚本.

在这种情况下,您需要投入更多 - 而且刮刀投资要少得多.

想想你想要投入时间和精力的地方......

Ps:英语不是我的母语 - 所以请原谅我破碎的英语......

Answer 9

STW*_*STW 8

不幸的是,您最好的选择是相当手动:查找您认为表示抓取并禁止其IP地址的流量模式.

既然你正在谈论一个公共网站,那么使网站搜索引擎友好也将使网站刮不过友好.如果搜索引擎可以抓取并抓取您的网站,那么恶意抓取工具也可以.这是一个很好的行走.

IP阻塞会减慢刮刀速度,但对您的服务器来说也是很多工作.假设我用1000个代理搜索你,我仍然得到了我想要的数据,现在你的防火墙一团糟. (4认同)

Answer 10

den*_*ees 8

从技术角度来看:只需模拟Google在您一次查询过多查询时所执行的操作.这应该会停止很多.

从法律角度来看:听起来您发布的数据并非专有.这意味着您要发布不受版权保护的名称和统计信息以及其他信息.

如果是这种情况,则刮刀不会通过重新分发您的艺术家姓名等信息来侵犯版权.但是,当他们将您的网站加载到内存中时,他们可能会侵犯版权,因为您的网站包含受版权保护的元素(如布局等).

我建议阅读关于Facebook诉Power.com的文章,看看Facebook用来阻止屏幕抓取的论点.有许多法律方法可以阻止某人欺骗您的网站.它们可以是深远的和富有想象力的.有时法院会购买论据.有时候他们没有.

但是,假设您发布的公共域名信息不具有名称和基本统计信息的版权......您应该以自由言论和开放数据的名义进行宣传.那就是网络的全部内容.

Answer 11

hoj*_*oju 8

我已经做了大量的网页抓取,并总结了一些技术, 根据我觉得烦人的东西,在我的博客上停止网页抓取工具.

这是您的用户和刮刀之间的权衡.如果你限制IP,使用CAPTCHA,需要登录等,你就会对刮刀造成困难.但这也可能会驱逐你的真正用户.

Answer 12

pgu*_*rio 8

可能对初学者刮刀有用的东西:

IP阻止
使用大量的ajax
检查referer请求标头
要求登录

一般有用的事情:

每周更改您的布局
的robots.txt

有用的东西会让你的用户讨厌你:

验证码

Answer 13

Way*_*ner 7

当然有可能.要获得100%的成功,请使您的网站脱机.

实际上,你可以做一些使刮擦更困难的事情.Google会进行浏览器检查,以确保您不是机器人搜索搜索结果(尽管这与大多数其他内容一样,可能会被欺骗).

您可以执行首次连接到您的网站和后续点击之间需要几秒钟的事情.我不确定理想的时间是什么,或者究竟该怎么做,但这是另一个想法.

我确信还有其他几个人有更多的经验,但我希望这些想法至少有些帮助.

Answer 14

nat*_*han 6

不,不可能(以任何方式)停止
接受它.为什么不发布为RDFa并成为超级搜索引擎友好并鼓励重用数据？人们会感谢你并提供应有的信用(以musicbrainz为例).

这不是你可能想要的答案,但为什么要隐藏你想要公开的内容？

Answer 15

tho*_*ter 6

您可以采取一些措施来防止屏幕抓取.有些不是很有效,而有些(CAPTCHA)则有效,但会阻碍可用性.您还必须牢记,它可能会阻碍合法的网站搜索引擎,例如搜索引擎索引.

但是,我认为如果你不希望它被刮掉,这意味着你不希望搜索引擎将其编入索引.

以下是您可以尝试的一些事项:

在图像中显示文本.这是非常可靠的,并且与CAPTCHA相比,对用户来说不那么痛苦,但意味着它们无法剪切和粘贴,并且不会精确缩放或可访问.
使用验证码并要求在返回页面之前完成验证.这是一种可靠的方法,也是对用户施加的最大痛苦.
要求用户在查看页面之前注册帐户,并确认其电子邮件地址.这将非常有效,但并非完全 - 屏幕抓取器可能会设置一个帐户,并可能巧妙地编写脚本以登录它们.
如果客户端的用户代理字符串为空,则阻止访问.网站抓取脚本通常会被懒惰地编程,并且不会设置用户代理字符串,而所有网络浏览器都会.
您可以在发现它们时设置已知屏幕抓取器用户代理字符串的黑名单.同样,这只会帮助懒惰编码的那些; 知道自己在做什么的程序员可以设置用户代理字符串来模拟Web浏览器.
经常更改URL路径.当您更改它时,请确保旧的一个继续工作,但只有一个用户可能打开浏览器.难以预测新的URL路径.这将使脚本难以在其URL被硬编码时获取它.最好用某种脚本来做这件事.

如果我必须这样做,我可能会使用最后三个的组合,因为它们可以最大限度地减少对合法用户造成的不便.但是,你必须接受你不能以这种方式阻止所有人,一旦有人弄清楚如何绕过它,他们将能够永远地刮掉它.然后,您可以尝试阻止他们的IP地址,因为我发现它们.

Answer 16

小智 5

方法一(仅限小站点):提供
加密/编码数据.
我使用python(urllib,requests,beautifulSoup等等)浏览网页,发现许多网站提供的加密/编码数据在任何编程语言中都无法解密,因为加密方法不存在.

我在PHP网站上通过加密和最小化输出实现了这一点(警告:对于大型网站来说这不是一个好主意)响应始终是混乱的内容.

最小化PHP输出的示例(如何缩小php页面html输出？):

<?php
  function sanitize_output($buffer) {
    $search = array(
      '/\>[^\S ]+/s', // strip whitespaces after tags, except space
      '/[^\S ]+\</s', // strip whitespaces before tags, except space
      '/(\s)+/s'      // shorten multiple whitespace sequences
    );
    $replace = array('>', '<', '\\1');
    $buffer = preg_replace($search, $replace, $buffer);
    return $buffer;
  }
  ob_start("sanitize_output");
?>

Run Code Online (Sandbox Code Playgroud)

方法二:
如果你不能阻止它们将它们作为回应用于伪造/无用数据.

方法三:
阻止常见的抓取用户代理,你会在大型/大型网站上看到这个,因为用户代理不可能用"python3.4"来抓它们.

方法四:
确保所有用户标题都有效,我有时会提供尽可能多的标题,以使我的刮刀看起来像一个真实的用户,其中一些甚至不像en-FU那样真实或有效:).
这是我通常提供的一些标题的列表.

headers = {
  "Requested-URI": "/example",
  "Request-Method": "GET",
  "Remote-IP-Address": "656.787.909.121",
  "Remote-IP-Port": "69696",
  "Protocol-version": "HTTP/1.1",
  "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
  "Accept-Encoding": "gzip,deflate",
  "Accept-Language": "en-FU,en;q=0.8",
  "Cache-Control": "max-age=0",
  "Connection": "keep-alive",
  "Dnt": "1",  
  "Host": "http://example.com",
  "Referer": "http://example.com",
  "Upgrade-Insecure-Requests": "1",
  "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111 Safari/537.36"
}

Run Code Online (Sandbox Code Playgroud)

归档时间：	16 年，1 月前
查看次数：	78893 次
最近记录：	7 年，8 月前