pix*_*xel 287 html architecture screen-scraping piracy-prevention
我有一个相当大的音乐网站,有一个大型的艺术家数据库.我一直在注意其他音乐网站抓取我们网站的数据(我在这里和那里输入虚拟艺术家名称然后谷歌搜索它们).
如何防止屏幕抓取?它甚至可能吗?
Jon*_*ica 300
注意:由于此答案的完整版本超出了Stack Overflow的长度限制,因此您需要前往GitHub阅读扩展版本,并提供更多提示和详细信息.
为了阻止抓取(也称为Webscraping,Screenscraping,Web数据挖掘,Web收集或Web数据提取),有助于了解这些抓取工具的工作方式,并且通过扩展,有助于防止它们正常工作.
有各种类型的刮刀,每种刮刀的工作方式不同:
蜘蛛,如Google的机器人或网站复印机,如HTtrack,它递归地跟随其他页面的链接以获取数据.这些有时用于有针对性的抓取以获取特定数据,通常与HTML解析器结合以从每个页面提取所需数据.
Shell脚本:有时,常用的Unix工具用于抓取:Wget或Curl下载页面,Grep(Regex)用于提取数据.
HTML解析器,例如基于Jsoup,Scrapy和其他的解析器.类似于基于shell脚本的正则表达式,它们通过基于HTML中的模式从页面中提取数据来工作,通常忽略其他所有内容.
例如:如果您的网站有搜索功能,这样的抓取工具可能会提交搜索请求,然后从结果页面HTML中获取所有结果链接及其标题,以便专门获取搜索结果链接及其标题.这些是最常见的.
Screenscrapers,基于例如.Selenium或PhantomJS,在真实浏览器中打开您的网站,运行JavaScript,AJAX等,然后从网页上获取所需的文本,通常是:
在页面加载并运行JavaScript之后从浏览器获取HTML,然后使用HTML解析器提取所需数据.这些是最常见的,因此许多破解HTML解析器/抓取器的方法也适用于此.
获取渲染页面的屏幕截图,然后使用OCR从屏幕截图中提取所需文本.这些是罕见的,只有真正想要您的数据的专用刮刀才会设置它.
Web Scraping服务,如ScrapingHub或Kimono.事实上,有些人的工作是弄清楚如何刮取您的网站并提取其他人使用的内容.
不出所料,专业的抓取服务是最难阻止的,但是如果你弄清楚如何刮擦你的网站会很困难和耗时,那些(以及支付他们这样做的人)可能不会费心去刮你的网站.
将您的网站嵌入到包含框架的其他网站的网页中,并将您的网站嵌入到移动应用中.
虽然不是技术上的刮擦,但移动应用程序(Android和iOS)可以嵌入网站,并注入自定义CSS和JavaScript,从而彻底改变页面的外观.
人工复制 - 粘贴:人们会复制并粘贴您的内容,以便在其他地方使用.
这些不同类型的刮刀之间存在很多重叠,即使使用不同的技术和方法,许多刮刀的行为也会相似.
这些提示主要是我自己的想法,我在编写刮刀时遇到的各种困难,以及来自互联网周围的一些信息和想法.
你不能完全阻止它,因为无论你做什么,坚定的刮刀仍然可以弄清楚如何刮.但是,您可以通过执行以下操作来停止大量搜索:
定期检查日志,如果有异常活动表明自动访问(刮刀),例如来自同一IP地址的许多类似操作,您可以阻止或限制访问.
具体来说,一些想法:
限速:
仅允许用户(和刮刀)在特定时间内执行有限数量的操作 - 例如,每秒只允许从任何特定IP地址或用户进行几次搜索.这会减慢刮刀的速度,使它们无效.如果操作的完成速度太快或太快,您还可以显示验证码.
检测异常活动:
如果您看到异常活动,例如来自特定IP地址的许多类似请求,某人查看过多页面或执行异常数量的搜索,您可以阻止访问,或显示后续请求的验证码.
不要只监控IP地址和速率限制 - 使用其他指标:
如果你阻止或限制速率,不要只按每个IP地址执行; 您可以使用其他指标和方法来识别特定用户或刮刀.一些可以帮助您识别特定用户/刮刀的指标包括:
用户填写表单的速度,以及单击按钮的位置;
您可以使用JavaScript收集大量信息,例如屏幕大小/分辨率,时区,安装的字体等; 您可以使用它来识别用户.
HTTP标头及其顺序,尤其是User-Agent.
As an example, if you get many request from a single IP address, all using the same User Agent, screen size (determined with JavaScript), and the user (scraper in this case) always clicks on the button in the same way and at regular intervals, it's probably a screen scraper; and you can temporarily block similar requests (eg. block all requests with that user agent and screen size coming from that particular IP address), and this way you won't inconvenience real users on that IP address, eg. in case of a shared internet connection.
You can also take this further, as you can identify similar requests, even if they come from different IP addresses, indicative of distributed scraping (a scraper using a botnet or a network of proxies). If you get a lot of otherwise identical requests, but they come from different IP addresses, you can block. Again, be aware of not inadvertently blocking real users.
This can be effective against screenscrapers which run JavaScript, as you can get a lot of information from them.
Related questions on Security Stack Exchange:
How to uniquely identify users with the same external IP address? for more details, and
Why do people use IP address bans when IP addresses often change? for info on the limits of these methods.
Instead of temporarily blocking access, use a Captcha:
The simple way to implement rate-limiting would be to temporarily block access for a certain amount of time, however using a Captcha may be better, see the section on Captchas further down.
Require account creation in order to view your content, if this is feasible for your site. This is a good deterrent for scrapers, but is also a good deterrent for real users.
In order to avoid scripts creating many accounts, you should:
Require an email address for registration, and verify that email address by sending a link that must be opened in order to activate the account. Allow only one account per email address.
Require a captcha to be solved during registration/account creation.
Requiring account creation to view content will drive users and search engines away; if you require account creation in order to view an article, users will go elsewhere.
Sometimes, scrapers will be run from web hosting services, such as Amazon Web Services or GAE, or VPSes. Limit access to your website (or show a captcha) for requests originating from the IP addresses used by such cloud hosting services.
Similarly, you can also limit access from IP addresses used by proxy or VPN providers, as scrapers may use such proxy servers to avoid many requests being detected.
Beware that by blocking access from proxy servers and VPNs, you will negatively affect real users.
If you do block/limit access, you should ensure that you don't tell the scraper what caused the block, thereby giving them clues as to how to fix their scraper. So a bad idea would be to show error pages with text like:
Too many requests from your IP address, please try again later.
Error, User Agent header not present !
Instead, show a friendly error message that doesn't tell the scraper what caused it. Something like this is much better:
helpdesk@example.com, should the problem persist.This is also a lot more user friendly for real users, should they ever see such an error page. You should also consider showing a captcha for subsequent requests instead of a hard block, in case a real user sees the error message, so that you don't block and thus cause legitimate users to contact you.
Captchas ("Completely Automated Test to Tell Computers and Humans apart") are very effective against stopping scrapers. Unfortunately, they are also very effective at irritating users.
As such, they are useful when you suspect a possible scraper, and want to stop the scraping, without also blocking access in case it isn't a scraper but a real user. You might want to consider showing a captcha before allowing access to the content if you suspect a scraper.
Things to be aware of when using Captchas:
Don't roll your own, use something like Google's reCaptcha : It's a lot easier than implementing a captcha yourself, it's more user-friendly than some blurry and warped text solution you might come up with yourself (users often only need to tick a box), and it's also a lot harder for a scripter to solve than a simple image served from your site
Don't include the solution to the captcha in the HTML markup: I've actually seen one website which had the solution for the captcha in the page itself, (although quite well hidden) thus making it pretty useless. Don't do something like this. Again, use a service like reCaptcha, and you won't have this kind of problem (if you use it properly).
Captchas can be solved in bulk: There are captcha-solving services where actual, low-paid, humans solve captchas in bulk. Again, using reCaptcha is a good idea here, as they have protections (such as the relatively short time the user has in order to solve the captcha). This kind of service is unlikely to be used unless your data is really valuable.
You can render text into an image server-side, and serve that to be displayed, which will hinder simple scrapers extracting text.
However, this is bad for screen readers, search engines, performance, and pretty much everything else. It's also illegal in some places (due to accessibility, eg. the Americans with Disabilities Act), and it's also easy to circumvent with some OCR, so don't do it.
You can do something similar with CSS sprites, but that suffers from the same problems.
If feasible, don't provide a way for a script/bot to get all of your dataset. As an example: You have a news site, with lots of individual articles. You could make those articles be only accessible by searching for them via the on site search, and, if you don't have a list of all the articles on the site and their URLs anywhere, those articles will be only accessible by using the search feature. This means that a script wanting to get all the articles off your site will have to do searches for all possible phrases which may appear in your articles in order to find them all, which will be time-consuming, horribly inefficient, and will hopefully make the scraper give up.
This will be ineffective if:
example.com/article.php?articleId=12345. This (and similar things) which will allow scrapers to simply iterate over all the articleIds and request all the articles that way.Make sure you don't expose any APIs, even unintentionally. For example, if you are using AJAX or network requests from within Adobe Flash or Java Applets (God forbid!) to load your data it is trivial to look at the network requests from the page and figure out where those requests are going to, and then reverse engineer and use those endpoints in a scraper program. Make sure you obfuscate your endpoints and make them hard for others to use, as described.
Since HTML parsers work by extracting content from pages based on identifiable patterns in the HTML, we can intentionally change those patterns in oder to break these scrapers, or even screw with them. Most of these tips also apply to other scrapers like spiders and screenscrapers too.
Scrapers which process HTML directly do so by extracting contents from specific, identifiable parts of your HTML page. For example: If all pages on your website have a div with an id of article-content, which contains the text of the article, then it is trivial to write a script to visit all the article pages on your site, and extract the content text of the article-content div on each article page, and voilà, the scraper has all the articles from your site in a format that can be reused elsewhere.
If you change the HTML and the structure of your pages frequently, such scrapers will no longer work.
You can frequently change the id's and classes of elements in your HTML, perhaps even automatically. So, if your div.article-content becomes something like div.a4c36dda13eaf0, and changes every week, the scraper will work fine initially, but will break after a week. Make sure to change the length of your ids/classes too, otherwise the scraper will use div.[any-14-characters] to find the desired div instead. Beware of other similar holes too..
If there is no way to find the desired content from the markup, the scraper will do so from the way the HTML is structured. So, if all your article pages are similar in that every div inside a div which comes after a h1 is the article content, scrapers will get the article content based on that. Again, to break this, you can add/remove extra markup to your HTML, periodically and randomly, eg. adding extra divs or spans. With modern server side HTML processing, this should not be too hard.
Things to be aware of:
It will be tedious and difficult to implement, maintain, and debug.
You will hinder caching. Especially if you change ids or classes of your HTML elements, this will require corresponding changes in your CSS and JavaScript files, which means that every time you change them, they will have to be re-downloaded by the browser. This will result in longer page load times for repeat visitors, and increased server load. If you only change it once a week, it will not be a big problem.
Clever scrapers will still be able to get your content by inferring where the actual content is, eg. by knowing that a large single block of text on the page is likely to be the actual article. This makes it possible to still find & extract the desired data from the page. Boilerpipe does exactly this.
Essentially, make sure that it is not easy for a script to find the actual, desired content for every similar page.
See also How to prevent crawlers depending on XPath from getting page contents for details on how this can be implemented in PHP.
This is sort of similar to the previous tip. If you serve different HTML based on your user's location/country (determined by IP address), this may break scrapers which are delivered to users. For example, if someone is writing a mobile app which scrapes data from your site, it will work fine initially, but break when it's actually distributed to users, as those users may be in a different country, and thus get different HTML, which the embedded scraper was not designed to consume.
An example: You have a search feature on your website, located at example.com/search?query=somesearchquery, which returns the following HTML:
<div class="search-result">
<h3 class="search-result-title">Stack Overflow has become the world's most popular programming Q & A website</h3>
<p class="search-result-excerpt">The website Stack Overflow has now become the most popular programming Q & A website, with 10 million questions and many users, which...</p>
<a class"search-result-link" href="/stories/story-link">Read more</a>
</div>
(And so on, lots more identically structured divs with search results)
Run Code Online (Sandbox Code Playgroud)
As you may have guessed this is easy to scrape: all a scraper needs to do is hit the search URL with a query, and extract the desired data from the returned HTML. In addition to periodically changing the HTML as described above, you could also leave the old markup with the old ids and classes in, hide it with CSS, and fill it with fake data, thereby poisoning the scraper. Here's how the search results page could be changed:
<div class="the-real-search-result">
<h3 class="the-real-search-result-title">Stack Overflow has become the world's most popular programming Q & A website</h3>
<p class="the-real-search-result-excerpt">The website Stack Overflow has now become the most popular programming Q & A website, with 10 million questions and many users, which...</p>
<a class"the-real-search-result-link" href="/stories/story-link">Read more</a>
</div>
<div class="search-result" style="display:none">
<h3 class="search-result-title">Visit Example.com now, for all the latest Stack Overflow related news !</h3>
<p class="search-result-excerpt">Example.com is so awesome, visit now !</p>
<a class"search-result-link" href="http://example.com/">Visit Now !</a>
</div>
(More real search results follow)
Run Code Online (Sandbox Code Playgroud)
This will mean that scrapers written to extract data from the HTML based on classes or IDs will continue to seemingly work, but they will get fake data or even ads, data which real users will never see, as they're hidden with CSS.
Adding on to the previous example, you can add invisible honeypot items to your HTML to catch scrapers. An example which could be added to the previously described search results page:
<div class="search-result" style="display:none">
<h3 class="search-result-title">This search result is here to prevent scraping</h3>
<p class="search-result-excerpt">If you're a human and see this, please ignore it. If you're a scraper, please click the link below :-)
Note that clicking the link below will block access to this site for 24 hours.</p>
<a class"search-result-link" href="/scrapertrap/scrapertrap.php">I'm a scraper !</a>
</div>
(The actual, real, search results follow.)
Run Code Online (Sandbox Code Playgroud)
A scraper written to get all the search results will pick this up, just like any of the other, real search results on the page, and visit the link, looking for the desired content. A real human will never even see it in the first place (due to it being hidden with CSS), and won't visit the link. A genuine and desirable spider such as Google's will not visit the link either because you disallowed /scrapertrap/ in your robots.txt.
You can make your scrapertrap.php do something like block access for the IP address that visited it or force a captcha for all subsequent requests from that IP.
Don't forget to disallow your honeypot (/scrapertrap/) in your robots.txt file so that search engine bots don't fall into it.
You can/should combine this with the previous tip of changing your HTML frequently.
Change this frequently too, as scrapers will eventually learn to avoid it. Change the honeypot URL and text. Also want to consider changing the inline CSS used for hiding, and use an ID attribute and external CSS instead, as scrapers will learn to avoid anything which has a style attribute with CSS used to hide the content. Also try only enabling it sometimes, so the scraper works initially, but breaks after a while. This also applies to the previous tip.
Malicious people can prevent access for real users by sharing a link to your honeypot, or even embedding that link somewhere as an image (eg. on a forum). Change the URL frequently, and make any ban times relatively short.
If you detect what is obviously a scraper, you can serve up fake and useless data; this will corrupt the data the scraper gets from your website. You should also make it impossible to distinguish such fake data from real data, so that scrapers don't know that they're being screwed with.
As an example: you have a news website; if you detect a scraper, instead of blocking access, serve up fake, randomly generated articles, and this will poison the data the scraper gets. If you make your fake data indistinguishable from the real thing, you'll make it hard for scrapers to get what they want, namely the actual, real data.
Often, lazily written scrapers will not send a User Agent header with their request, whereas all browsers as well as search engine spiders will.
If you get a request where the User Agent header is not present, you can show a captcha, or simply block or limit access. (Or serve fake data as described above, or something else..)
It's trivial to spoof, but as a measure against poorly written scrapers it is worth implementing.
Dan*_*ien 239
我会假设你已经成立了robots.txt.
正如其他人所提到的,刮刀几乎可以伪造他们活动的每个方面,而且可能很难识别来自坏人的请求.
我会考虑:
/jail.html.robots.txt(因此尊重的蜘蛛永远不会访问).display: none)隐藏它./jail.html.这可能有助于您快速识别来自刮刀的请求,这些请求是公然无视您的robots.txt.
你可能也想使你的/jail.html整个整个网站具有相同的,准确的标记为正常的网页,而是用假数据(/jail/album/63ajdka,/jail/track/3aads8等).这样,在你有机会完全阻止它们之前,坏的刮刀不会被警告"异常输入".
Uni*_*ron 47
苏,他们.
说真的:如果你有钱,可以和一位了解互联网的好的,年轻的律师交谈.你真的可以在这里做点什么.根据网站的所在地,您可以让律师在您所在的国家/地区写下停止和终止协议.你或许可以至少吓唬那些混蛋.
记录您的虚拟值的插入.插入明确(但模糊地)指向您的虚拟值.我认为这是电话簿公司的常见做法,而在德国,我认为有几个例子,当模仿者被他们以1:1复制的假条目破坏时.
如果这会让你弄乱你的HTML代码,拖累搜索引擎优化,有效性和其他东西(尽管在每个请求中对相同页面使用稍微不同的HTML结构的模板系统)可能已经帮助了很多,这将是一种耻辱.总是依赖HTML结构和类/ ID名称来获取内容的scraper.)
像这样的案件是版权法的好处.剥夺其他人的诚实工作以赚钱是你应该能够对抗的事情.
rye*_*guy 35
你真的无法完全阻止这一点.铲运机可以伪造其用户代理,使用多个IP地址等,并以普通用户身份出现.您唯一能做的就是在加载页面时使文本不可用 - 使用图像,flash或使用JavaScript加载它.但是,前两个是糟糕的想法,如果没有为某些常规用户启用JavaScript,则最后一个是可访问性问题.
如果他们绝对抨击您的网站并遍历您的所有网页,您可以进行某种速率限制.
虽然有一些希望.铲运机依赖于您网站的数据采用一致的格式.如果你能以某种方式随机化它可能会破坏他们的刮刀.比如在每次加载时更改页面元素的ID或类名等等.但这是很多工作要做,我不确定它是否值得.即便如此,他们也可能以足够的奉献精神来解决这个问题.
Wil*_*and 31
提供XML API以访问您的数据; 以一种易于使用的方式.如果人们想要你的数据,他们会得到它,你也可以全力以赴.
通过这种方式,您可以以有效的方式提供功能的子集,确保至少刮刀不会提高HTTP请求和大量带宽.
然后,您所要做的就是说服那些希望您的数据使用API的人.;)
Ars*_*eep 12
好吧,正如所有帖子所说,如果你想让它对搜索引擎友好,那么机器人可以肯定.
但你仍然可以做一些事情,它可能对60-70%的刮刀机器人有效.
制作如下的检查脚本.
如果某个特定的IP地址访问速度非常快,那么在几次访问(5-10)后,将其IP地址+浏览器信息放入文件或数据库中.
(这将是一个后台进程并在几分钟后运行或安排.)制作另一个脚本,继续检查那些可疑的IP地址.
案例1.如果用户代理是Google,Bing,Yahoo等已知搜索引擎(您可以通过Google搜索找到有关用户代理的更多信息).然后你必须看到http://www.iplists.com/.此列表并尝试匹配模式.如果它看起来像一个伪造的用户代理,那么请在下次访问时填写验证码.(你需要对机器人的IP地址进行更多的研究.我知道这是可以实现的,也可以尝试使用IP地址.这可能会有所帮助.)
案例2.没有搜索机器人的用户代理:只需要在下次访问时填写验证码.
jm6*_*666 10
迟到的答案 - 而且这个答案可能不是你想听到的那个......
我自己已经编写了许多(数十个)不同的专业数据挖掘工具.(仅仅因为我喜欢"开放数据"哲学).
这里已有许多其他答案的建议 - 现在我将扮演魔鬼的倡导者角色,并将扩展和/或纠正他们的有效性.
第一:
试图使用一些技术障碍是不值得的麻烦,造成:
简单的HMTL - 最简单的方法是解析纯HTML页面,具有明确定义的结构和css类.例如,使用Firebug检查元素就足够了,并在我的scraper中使用正确的Xpath和/或CSS路径.
您可以动态生成HTML结构,也可以动态生成CSS类名(以及CSS本身)(例如,通过使用一些随机类名) - 但是
您无法更改每个响应的结构,因为您的常规用户会讨厌您.此外,这将为您(维护)而不是刮刀造成更多麻烦.XPath或CSS路径可以通过刮擦脚本自动从已知内容中确定.
Ajax - 一开始有点难,但很多时候加快了抓取过程:) - 为什么?
在分析请求和响应时,我只是设置了自己的代理服务器(用perl编写),而我的Firefox正在使用它.当然,因为它是我自己的代理 - 它是完全隐藏的 - 目标服务器将其视为常规浏览器.(所以,没有X-Forwarded-for和这样的标题).基于代理日志,大多数情况下可以确定ajax请求的"逻辑",例如我可以跳过大部分html抓取,并且只使用结构良好的ajax响应(主要是JSON格式).
所以,ajax没有多大帮助......
一些更复杂的页面使用了很多 打包的javascript函数.
这里可以使用两种基本方法:
这样的抓取速度很慢(抓取工作与普通浏览器一样),但确实如此
基于用户代理的过滤根本没有帮助.任何认真的数据挖掘者都会在他的刮刀中将其设置为正确的数据.
需要登录 - 没有帮助.最简单的方法是打败它(没有任何分析和/或脚本登录协议)只是以常规用户身份登录网站,使用Mozilla并在运行基于Mozrepl的刮刀后......
请记住,require登录对匿名机器人有帮助,但对于想要抓取数据的人没有帮助.他只是以常规用户身份登记到您的网站.
使用框架也不是很有效.这被许多现场电影服务使用,并不是很难被击败.框架只是另一个需要分析的HTML/Javascript页面...如果数据值得麻烦 - 数据挖掘者将进行必要的分析.
基于IP的限制根本没有效果 - 这里有太多的公共代理服务器,这里也是TOR ...... :)它不会减慢抓取速度(对于真正想要你的数据的人来说).
非常困难的是隐藏在图像中的数据.(例如,简单地将数据转换为服务器端的图像).使用"tesseract"(OCR)可以帮助很多次 - 但老实说 - 数据必须值得刮刀的麻烦.(这很多次都不值得).
另一方面,您的用户会因此而讨厌您.我自己,(即使不刮)讨厌不允许将页面内容复制到剪贴板的网站(因为信息在图像中,或者(愚蠢的)试图绑定到右键单击一些自定义Javascript事件. )
最难的是使用java applet或flash的站点,applet在内部使用安全的https请求.但请三思而后行 - 您的iPhone用户会有多开心......;).因此,目前很少有网站使用它们.我自己,阻止浏览器中的所有Flash内容(在常规浏览会话中) - 并且从不使用依赖于Flash的网站.
你的里程碑可能是......,所以你可以尝试这种方法 - 只记得 - 你可能会失去一些用户.还记得,一些SWF文件是可以解压缩的.;)
Captcha(好的 - 像reCaptcha)有很多帮助 - 但是你的用户会讨厌你...... - 想象一下,当用户需要在显示音乐艺术家信息的所有页面中解决一些验证码时,你的用户会如何爱你.
可能不需要继续 - 你已经进入了画面.
现在你应该做什么:
请记住:如果您希望将其(以友好的方式)发布给普通用户,则几乎不可能隐藏您的数据.
所以,
在尝试使用某些技术障碍之前请三思而后行.
而不是试图阻止数据挖掘者,只需为您的网站可用性添加更多努力.你的用户会爱你.投入技术障碍的时间(和能源)通常不值得 - 更好地花时间制作更好的网站......
此外,数据窃贼不像普通小偷.
如果你买一个廉价的家庭警报并添加一个警告"这个房子与警察联系" - 许多小偷甚至不会试图闯入.因为他有一个错误的举动 - 他要坐牢......
所以,你只投资几块钱,但小偷投资和风险很大.
但数据窃贼没有这样的风险.恰恰相反 - 如果你做出错误的举动(例如,如果你因为技术障碍而引入了一些BUG),那么你将失去用户.如果抓取机器人第一次不起作用,没有任何反应 - 数据挖掘者只会尝试另一种方法和/或调试脚本.
在这种情况下,您需要投入更多 - 而且刮刀投资要少得多.
想想你想要投入时间和精力的地方......
Ps:英语不是我的母语 - 所以请原谅我破碎的英语......
不幸的是,您最好的选择是相当手动:查找您认为表示抓取并禁止其IP地址的流量模式.
既然你正在谈论一个公共网站,那么使网站搜索引擎友好也将使网站刮不过友好.如果搜索引擎可以抓取并抓取您的网站,那么恶意抓取工具也可以.这是一个很好的行走.
从技术角度来看:只需模拟Google在您一次查询过多查询时所执行的操作.这应该会停止很多.
从法律角度来看:听起来您发布的数据并非专有.这意味着您要发布不受版权保护的名称和统计信息以及其他信息.
如果是这种情况,则刮刀不会通过重新分发您的艺术家姓名等信息来侵犯版权.但是,当他们将您的网站加载到内存中时,他们可能会侵犯版权,因为您的网站包含受版权保护的元素(如布局等).
我建议阅读关于Facebook诉Power.com的文章,看看Facebook用来阻止屏幕抓取的论点.有许多法律方法可以阻止某人欺骗您的网站.它们可以是深远的和富有想象力的.有时法院会购买论据.有时候他们没有.
但是,假设您发布的公共域名信息不具有名称和基本统计信息的版权......您应该以自由言论和开放数据的名义进行宣传.那就是网络的全部内容.
可能对初学者刮刀有用的东西:
一般有用的事情:
有用的东西会让你的用户讨厌你:
当然有可能.要获得100%的成功,请使您的网站脱机.
实际上,你可以做一些使刮擦更困难的事情.Google会进行浏览器检查,以确保您不是机器人搜索搜索结果(尽管这与大多数其他内容一样,可能会被欺骗).
您可以执行首次连接到您的网站和后续点击之间需要几秒钟的事情.我不确定理想的时间是什么,或者究竟该怎么做,但这是另一个想法.
我确信还有其他几个人有更多的经验,但我希望这些想法至少有些帮助.
这不是你可能想要的答案,但为什么要隐藏你想要公开的内容?
您可以采取一些措施来防止屏幕抓取.有些不是很有效,而有些(CAPTCHA)则有效,但会阻碍可用性.您还必须牢记,它可能会阻碍合法的网站搜索引擎,例如搜索引擎索引.
但是,我认为如果你不希望它被刮掉,这意味着你不希望搜索引擎将其编入索引.
以下是您可以尝试的一些事项:
如果我必须这样做,我可能会使用最后三个的组合,因为它们可以最大限度地减少对合法用户造成的不便.但是,你必须接受你不能以这种方式阻止所有人,一旦有人弄清楚如何绕过它,他们将能够永远地刮掉它.然后,您可以尝试阻止他们的IP地址,因为我发现它们.
小智 5
方法一(仅限小站点):提供
加密/编码数据.
我使用python(urllib,requests,beautifulSoup等等)浏览网页,发现许多网站提供的加密/编码数据在任何编程语言中都无法解密,因为加密方法不存在.
我在PHP网站上通过加密和最小化输出实现了这一点(警告:对于大型网站来说这不是一个好主意)响应始终是混乱的内容.
最小化PHP输出的示例(如何缩小php页面html输出?):
<?php
function sanitize_output($buffer) {
$search = array(
'/\>[^\S ]+/s', // strip whitespaces after tags, except space
'/[^\S ]+\</s', // strip whitespaces before tags, except space
'/(\s)+/s' // shorten multiple whitespace sequences
);
$replace = array('>', '<', '\\1');
$buffer = preg_replace($search, $replace, $buffer);
return $buffer;
}
ob_start("sanitize_output");
?>
Run Code Online (Sandbox Code Playgroud)
方法二:
如果你不能阻止它们将它们作为回应用于伪造/无用数据.
方法三:
阻止常见的抓取用户代理,你会在大型/大型网站上看到这个,因为用户代理不可能用"python3.4"来抓它们.
方法四:
确保所有用户标题都有效,我有时会提供尽可能多的标题,以使我的刮刀看起来像一个真实的用户,其中一些甚至不像en-FU那样真实或有效:).
这是我通常提供的一些标题的列表.
headers = {
"Requested-URI": "/example",
"Request-Method": "GET",
"Remote-IP-Address": "656.787.909.121",
"Remote-IP-Port": "69696",
"Protocol-version": "HTTP/1.1",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Encoding": "gzip,deflate",
"Accept-Language": "en-FU,en;q=0.8",
"Cache-Control": "max-age=0",
"Connection": "keep-alive",
"Dnt": "1",
"Host": "http://example.com",
"Referer": "http://example.com",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111 Safari/537.36"
}
Run Code Online (Sandbox Code Playgroud)