Che*_*ian 107
简单的方法开始,尝试jQuery
$("#links").load("/Main_Page #jq-p-Getting-Started li");
Run Code Online (Sandbox Code Playgroud)
更多关于jQuery Docs
另一种以更加结构化的方式进行屏幕抓取的方法是使用YQL或Yahoo Query Language.它将返回结构化为JSON或xml的抓取数据.
例如,
让我们刮掉stackoverflow.com
select * from html where url="http://stackoverflow.com"
Run Code Online (Sandbox Code Playgroud)
会给你一个JSON数组(我选择了那个选项)
"results": {
"body": {
"noscript": [
{
"div": {
"id": "noscript-padding"
}
},
{
"div": {
"id": "noscript-warning",
"p": "Stack Overflow works best with JavaScript enabled"
}
}
],
"div": [
{
"id": "notify-container"
},
{
"div": [
{
"id": "header",
"div": [
{
"id": "hlogo",
"a": {
"href": "/",
"img": {
"alt": "logo homepage",
"height": "70",
"src": "http://i.stackoverflow.com/Content/Img/stackoverflow-logo-250.png",
"width": "250"
}
……..
Run Code Online (Sandbox Code Playgroud)
这样做的好处在于,您可以进行投影以及哪些条款最终可以为您提供所需的数据,而且只有您需要的数据(最终可以通过线路获得更少的带宽),
例如
select * from html where url="http://stackoverflow.com" and
xpath='//div/h3/a'
Run Code Online (Sandbox Code Playgroud)
会得到你
"results": {
"a": [
{
"href": "/questions/414690/iphone-simulator-port-for-windows-closed",
"title": "Duplicate: Is any Windows simulator available to test iPhone application? as a hobbyist who cannot afford a mac, i set up a toolchain kit locally on cygwin to compile objecti … ",
"content": "iphone\n simulator port for windows [closed]"
},
{
"href": "/questions/680867/how-to-redirect-the-web-page-in-flex-application",
"title": "I have a button control ....i need another web page to be redirected while clicking that button .... how to do that ? Thanks ",
"content": "How\n to redirect the web page in flex application ?"
},
…..
Run Code Online (Sandbox Code Playgroud)
现在只收到我们提出的问题
select title from html where url="http://stackoverflow.com" and
xpath='//div/h3/a'
Run Code Online (Sandbox Code Playgroud)
注意投影中的标题
"results": {
"a": [
{
"title": "I don't want the function to be entered simultaneously by multiple threads, neither do I want it to be entered again when it has not returned yet. Is there any approach to achieve … "
},
{
"title": "I'm certain I'm doing something really obviously stupid, but I've been trying to figure it out for a few hours now and nothing is jumping out at me. I'm using a ModelForm so I can … "
},
{
"title": "when i am going through my project in IE only its showing errors A runtime error has occurred Do you wish to debug? Line 768 Error:Expected')' Is this is regarding any script er … "
},
{
"title": "I have a java batch file consisting of 4 execution steps written for analyzing any Java application. In one of the steps, I'm adding few libs in classpath that are needed for my co … "
},
{
……
Run Code Online (Sandbox Code Playgroud)
编写查询后,它会为您生成一个URL
在我们的例子中.
所以最终你最终会做这样的事情
var titleList = $.getJSON(theAboveUrl);
Run Code Online (Sandbox Code Playgroud)
并与它一起玩.
美丽,不是吗?
kar*_*m79 30
可以使用Javascript,只要您通过域名代理抓取任何页面:
<html>
<head>
<script src="/js/jquery-1.3.2.js"></script>
</head>
<body>
<script>
$.get("www.mydomain.com/?url=www.google.com", function(response) {
alert(response)
});
</script>
</body>
Run Code Online (Sandbox Code Playgroud)
您只需使用XmlHttp(AJAX)命中所需的URL,即可在该responseText属性中使用URL中的HTML响应.如果它不是同一个域,您的用户将收到一个浏览器提醒,上面写着"此页面正在尝试访问其他域名.您要允许此操作吗?"
小智 6
您可以使用fetch:
const URL = 'https://www.sap.com/belgique/index.html';
fetch(URL)
.then(res => res.text())
.then(text => {
console.log(text);
})
.catch(err => console.log(err));Run Code Online (Sandbox Code Playgroud)