Nec*_*net 8 xpath scrapy web-scraping shadow-dom
目前我正在抓取文章新闻网站,在获取其主要内容的过程中,我遇到了很多人在其中嵌入了如下推文的问题:
我将 XPath 表达式与XPath 助手(chrome 插件)一起使用以测试我是否可以获得内容,然后将此表达式添加到 scrapy python,但是元素内部的#shadow-root元素似乎超出了 DOM 的范围,我正在寻找获取在这些类型的元素中获取内容的方法,最好使用 XPath。
使用不支持 Shadow DOM API 的工具来抓取包含 Shadow DOM 的页面的一种方法是递归迭代 Shadow DOM 元素并将其替换为 HTML 代码:
// Returns HTML of given shadow DOM.
const getShadowDomHtml = (shadowRoot) => {
let shadowHTML = '';
for (let el of shadowRoot.childNodes) {
shadowHTML += el.nodeValue || el.outerHTML;
}
return shadowHTML;
};
// Recursively replaces shadow DOMs with their HTML.
const replaceShadowDomsWithHtml = (rootElement) => {
for (let el of rootElement.querySelectorAll('*')) {
if (el.shadowRoot) {
replaceShadowDomsWithHtml(el.shadowRoot)
el.innerHTML += getShadowDomHtml(el.shadowRoot);
}
}
};
replaceShadowDomsWithHtml(document.body);
Run Code Online (Sandbox Code Playgroud)
If you are scraping using a full browser (Chrome with Puppeteer, PhantomJS, etc.) then just inject this script to the page. Important is to execute this after the whole page is rendered because it possibly breaks the JS code of shadow DOM components.
Check full article I wrote on this topic: https://kb.apify.com/tips-and-tricks/how-to-scrape-pages-with-shadow-dom
| 归档时间: |
|
| 查看次数: |
8633 次 |
| 最近记录: |