我知道如何在页面上获取所有可见的纯文本:
const text = await page.$eval('*', el => el.innerText);
但是我还需要知道每段文本属于页面的哪个元素,我找不到办法做到这一点。
在客户端,您可以使用TreeWalker以保留顺序的方式执行此操作。Here\xe2\x80\x99s 是一个示例,其中包含来自Web Scraper 测试场的示例内容:
\n\nconst IGNORE = ["style", "script"];\r\n\r\nconst walker = document.createTreeWalker(document.body, NodeFilter.SHOW_TEXT);\r\n\r\nconst pairs = [];\r\n\r\nlet node;\r\n\r\nwhile ((node = walker.nextNode()) !== null) {\r\n const parent = node.parentNode.tagName;\r\n\r\n if (IGNORE.includes(parent)) {\r\n continue;\r\n }\r\n\r\n const value = node.nodeValue.trim();\r\n\r\n if (value.length === 0) {\r\n continue;\r\n }\r\n\r\n pairs.push([parent.toLowerCase(), value]);\r\n}\r\n\r\nconsole.log(pairs);Run Code Online (Sandbox Code Playgroud)\r\n<div id="topbar"></div>\r\n\t\t<a href="/" style="text-decoration: none">\r\n\t\t <div id="title">WEB SCRAPER TESTING GROUND</div>\r\n\t\t <div id="logo"></div>\r\n\t\t</a>\r\n\t\t<div id="content">\r\n<h1>BLOCKS: Price List </h1>\r\n<div id="caseinfo">In this test, the web scraper needs to scrape a price list organized in a block layout. Specifically, it has to:\r\n\t<ol>\r\n\t\t<li>Extract all the products (their names, descriptions and prices), while skipping advertisements</li>\r\n\t\t<li>Scrape discounted products only</li>\r\n\t\t<li>Scrape products with red prices only</li>\r\n\t</ol>\r\n<p>\r\n</p><p>There is a <b>ver</b> parameter (which varies from 1 to 5) to show different table versions (with different product numbers, best price and advertisement positions).</p>\r\n<p>Also there are two tables presented:\r\n\t</p><ul>\r\n\t\t<li><b>Case 1</b> (simple one, with products and prices placed into the same block)\r\n\t\t</li><li><b>Case 2</b> (complicated one, with products and prices placed into separate blocks)</li>\r\n\t</ul>\r\n<p></p>\r\n<p>For testing, you may use the following sample links. The scraper should sufficiently scrape all data from a certain case using the same project:\r\n</p><ul>\r\n\t<li><a href="/blocks?ver=1">Price list 1</a></li>\r\n\t<li><a href="/blocks?ver=2">Price list 2</a></li>\r\n\t<li><a href="/blocks?ver=3">Price list 3</a></li>\r\n\t<li><a href="/blocks?ver=4">Price list 4</a></li>\r\n\t<li><a href="/blocks?ver=5">Price list 5</a></li>\r\n</ul>\r\n<p></p>\r\n</div>\r\n\r\n<div id="case_blocks">\r\n\r\n<h2>Case 1</h2>\r\n<div id="case1">\r\n<div class="prod2"><span style="float: left"><div class="name">Dell Latitude D610-1.73 Laptop Wireless Computer</div>2 GHz Intel Pentium M, 1 GB DDR2 SDRAM, 40 GB, Microsoft Windows XP Professional</span><span style="float: right">$239.95</span></div><div class="prod1"><span style="float: left"><div class="name">Samsung Chromebook (Wi-Fi, 11.6-Inch)</div>1.7 GHz, 2 GB DDR3 SDRAM, 16 GB, Chrome</span><span style="float: right" class="best">$249.00</span><span style="float: right;margin-right:10px" class="best">BEST<br>PRICE!</span></div><div class="ads">ADVERTISEMENT</div><div class="prod2"><span style="float: left"><div class="name">Apple MacBook Pro MD101LL/A 13.3-Inch Laptop (NEWEST VERSION)</div>2.5 GHz Intel Core i5, 4 GB DDR3 SDRAM, 500 GB Serial ATA, Mac OS X v10.7 Lion</span><span style="float: right">$1,099.99</span></div><div class="prod1"><span style="float: left"><div class="name">Acer Aspire AS5750Z-4835 15.6-Inch Laptop (Black)</div>2 GHz Pentium B940, 4 GB SDRAM, 500 GB, Windows 7 Home Premium 64-bit</span><span style="float: right" class="best">$385.72</span><span style="float: right;margin-right:10px" class="best">BEST<br>PRICE!</span></div><div class="ads">ADVERTISEMENT</div><div class="prod2"><span style="float: left"><div class="name">HP Pavilion g7-2010nr 17.3-Inch Laptop (Black)</div>2.3 GHz Core i3-2350M, 6 GB SDRAM, 640 GB, Windows 7 Home Premium 64-bit</span><span style="float: right">$549.99<div class="disc">discount 7%</div></span></div><div class="prod1"><span style="float: left"><div class="name">ASUS A53Z-AS61 15.6-Inch Laptop (Mocha)</div>1.4 GHz A-Series Quad-Core A6-3420M, 4 GB DIMM, 750 GB, Windows 7 Home Premium 64-bit</span><span style="float: right">$399.99</span></div></div>\r\n\r\n<h2 style="margin-top: 50px">Case 2</h2>\r\n<div id="case2">\r\n<div class="left"><div class="prod2"><div class="name">Dell Latitude D610-1.73 Laptop Wireless Computer</div>2 GHz Intel Pentium M, 1 GB DDR2 SDRAM, 40 GB, Microsoft Windows XP Professional</div><div class="prod1"><div class="name">Samsung Chromebook (Wi-Fi, 11.6-Inch)</div>1.7 GHz, 2 GB DDR3 SDRAM, 16 GB, Chrome</div><div class="ads">ADVERTISEMENT</div><div class="prod2"><div class="name">Apple MacBook Pro MD101LL/A 13.3-Inch Laptop (NEWEST VERSION)</div>2.5 GHz Intel Core i5, 4 GB DDR3 SDRAM, 500 GB Serial ATA, Mac OS X v10.7 Lion</div><div class="prod1"><div class="name">Acer Aspire AS5750Z-4835 15.6-Inch Laptop (Black)</div>2 GHz Pentium B940, 4 GB SDRAM, 500 GB, Windows 7 Home Premium 64-bit</div></div><div class="right"><div class="price2">$239.95</div><div class="price1 best">$249.00</div><div class="ads"></div><div class="price2">$1,099.99</div><div class="price1 best">$385.72</div></div><div class="ads" style="clear: both">ADVERTISEMENT</div><div class="left"><div class="prod2"><div class="name">HP Pavilion g7-2010nr 17.3-Inch Laptop (Black)</div>2.3 GHz Core i3-2350M, 6 GB SDRAM, 640 GB, Windows 7 Home Premium 64-bit</div><div class="prod1"><div class="name">ASUS A53Z-AS61 15.6-Inch Laptop (Mocha)</div>1.4 GHz A-Series Quad-Core A6-3420M, 4 GB DIMM, 750 GB, Windows 7 Home Premium 64-bit</div></div><div class="right"><div class="price2">$549.99<div class="disc">discount 7%</div></div><div class="price1">$399.99</div></div></div>\r\n\r\n</div>\r\n<br><br><br>\r\n\t\t</div>Run Code Online (Sandbox Code Playgroud)\r\n根据 Grant Miller\xe2\x80\x99s 的回答,可以evaluate在 Puppeteer 中调用它:
const pairs = await page.evaluate(() => {\n const IGNORE = ["style", "script"];\n const NONWHITESPACE_RE = /\\S/;\n\n const result = document.evaluate(\n "//*[child::text()]",\n document,\n null,\n XPathResult.ORDERED_NODE_SNAPSHOT_TYPE,\n null\n );\n\n const pairs = [];\n\n for (let i = 0, j = result.snapshotLength; i < j; i++) {\n const element = result.snapshotItem(i);\n\n if (IGNORE.includes(element.tagName.toLowerCase())) {\n continue;\n }\n\n const nodes = [...element.childNodes];\n\n for (const node of nodes) {\n if (node.nodeType !== document.TEXT_NODE) {\n continue;\n }\n\n if (node.nodeValue.search(NONWHITESPACE_RE) === -1) {\n continue;\n }\n\n pairs.push({\n tag: element.tagName.toLowerCase(),\n text: node.nodeValue.trim()\n });\n }\n }\n\n return pairs;\n});\n\nconsole.log(pairs);\nRun Code Online (Sandbox Code Playgroud)\n\n这是客户端函数的原始版本,它使用 XPath 但始终将节点的直接子节点放在其间接子节点之前:
\n\nconst pairs = await page.evaluate(() => {\n const IGNORE = ["style", "script"];\n const NONWHITESPACE_RE = /\\S/;\n\n const result = document.evaluate(\n "//*[child::text()]",\n document,\n null,\n XPathResult.ORDERED_NODE_SNAPSHOT_TYPE,\n null\n );\n\n const pairs = [];\n\n for (let i = 0, j = result.snapshotLength; i < j; i++) {\n const element = result.snapshotItem(i);\n\n if (IGNORE.includes(element.tagName.toLowerCase())) {\n continue;\n }\n\n const nodes = [...element.childNodes];\n\n for (const node of nodes) {\n if (node.nodeType !== document.TEXT_NODE) {\n continue;\n }\n\n if (node.nodeValue.search(NONWHITESPACE_RE) === -1) {\n continue;\n }\n\n pairs.push({\n tag: element.tagName.toLowerCase(),\n text: node.nodeValue.trim()\n });\n }\n }\n\n return pairs;\n});\n\nconsole.log(pairs);\nRun Code Online (Sandbox Code Playgroud)\r\nconst IGNORE = ["style", "script"];\r\nconst NONWHITESPACE_RE = /\\S/;\r\n\r\n// get all text nodes in the document\r\nconst result = document.evaluate(\r\n // matches any node in the document that has at least one direct\r\n // text node child, including whitespace-only nodes\r\n "//*[child::text()]",\r\n document,\r\n null,\r\n XPathResult.ORDERED_NODE_SNAPSHOT_TYPE,\r\n null\r\n);\r\n\r\n// the result doesn\'t use the JavaScript iterator protocol, so we have\r\n// to manually iterate over the elements\r\nconst pairs = [];\r\n\r\nfor (let i = 0, j = result.snapshotLength; i < j; i++) {\r\n const element = result.snapshotItem(i);\r\n\r\n if (IGNORE.includes(element.tagName.toLowerCase())) {\r\n continue;\r\n }\r\n\r\n const nodes = [...element.childNodes];\r\n\r\n for (const node of nodes) {\r\n if (node.nodeType !== document.TEXT_NODE) {\r\n continue;\r\n }\r\n\r\n // filter out whitespace-only nodes\r\n if (node.nodeValue.search(NONWHITESPACE_RE) === -1) {\r\n continue;\r\n }\r\n\r\n pairs.push({\r\n tag: element.tagName.toLowerCase(),\r\n // remove the `.trim()` to preserve leading & trailing whitespace\r\n text: node.nodeValue.trim()\r\n });\r\n }\r\n}\r\n\r\nconsole.log(pairs);Run Code Online (Sandbox Code Playgroud)\r\n| 归档时间: |
|
| 查看次数: |
1255 次 |
| 最近记录: |