用于选择并连接所有文本节点的 XPath

Har*_*ish 5 xpath r web-scraping rvest

我正在从一个网站上抓取数据,如下所示:

\n\n
<div class="content">\n  <blockquote>\n    <div>\n      Do not select this.\n    </div>\n    How do I select only this\xe2\x80\xa6\n    <br />\n    and this\xe2\x80\xa6\n    <br />\n    and this in a single node?\n  </blockquote>\n</div>\n
Run Code Online (Sandbox Code Playgroud)\n\n

假设这样的代码片段在单个页面上出现 20 次,我想获取 中的所有文本,<blockquote>但忽略子节点(例如内部 )中的所有内容div

\n\n

因此我使用:

\n\n
html %>%\n  html_nodes(xpath = "//*[@class=\'content\']/blockquote/text()[normalize-space()]")\n
Run Code Online (Sandbox Code Playgroud)\n\n

然而,这将How do I select only this\xe2\x80\xa6, and this\xe2\x80\xa6,and this in a single node?分成结构内的各个元素xml_nodeset

\n\n

我应该怎么做才能将所有这些文本节点本质上连接成一个并返回相同的 20 个元素(或者返回一个元素,以防我所拥有的只是这个示例)?

\n

And*_*son 5

您可以尝试使用下面的 XPath 连接所有子子字符串:

\n\n
"string-join(//*[@class=\'content\']/blockquote/text()[normalize-space()], \' \')"\n
Run Code Online (Sandbox Code Playgroud)\n\n

输出是

\n\n
How do I select only this\xe2\x80\xa6 and this\xe2\x80\xa6 and this in a single node?\n
Run Code Online (Sandbox Code Playgroud)\n


yif*_*yan 2

您可以使用 CSS 或 XPATH 函数删除节点xml_remove()

\n\n
library(rvest)\n\ntext <- \'<div class="content">\n  <blockquote>\n    <div>\n      Do not select this.\n    </div>\n    How do I select only this\xe2\x80\xa6\n    <br />\n    and this\xe2\x80\xa6\n    <br />\n    and this in a single node?\n  </blockquote>\n</div>\'\n\nmyhtml <- read_html(text)\n\n#select the nodes you don\'t want to select\ndo_not_select <- myhtml %>%\n    html_nodes("blockquote>div") #using css\n\n#remove those nodes\nxml_remove(do_not_select)\n
Run Code Online (Sandbox Code Playgroud)\n\n

您可以稍后删除空格和 \\n

\n\n
#sample result\nmyhtml %>%\n    html_text()\n[1] "\\n  \\n    \\n    How do I select only this\xe2\x80\xa6\\n    \\n    and this\xe2\x80\xa6\\n    \\n    and this in a single node?\\n  \\n"\n
Run Code Online (Sandbox Code Playgroud)\n