And*_*eas 2 r css-selectors web-scraping rvest
我的目标是library(tm)在一个相当大的 word 文档上使用该工具包。word文档有合理的排版,所以我们有h1主要部分,一些h2和h3副标题。我想对每个部分进行比较和文本挖掘(每个部分下面的文本h1- 副标题并不重要 - 因此可以包含或排除它们。)
我的策略是将worddocument导出为html,然后使用pacakgervest提取段落。
library(rvest)
# the file has latin-1 chars
#Sys.setlocale(category="LC_ALL", locale="da_DK.UTF-8")
# small example html file
file <- rvest::html("https://83ae1009d5b31624828197160f04b932625a6af5.googledrive.com/host/0B9YtZi1ZH4VlaVVCTGlwV3ZqcWM/tidy.html", encoding = 'utf-8')
nodes <- file %>%
rvest::html_nodes("h1>p") %>%
rvest::html_text()
Run Code Online (Sandbox Code Playgroud)
我可以提取所有<p>用html_nodes("p"),但是那只是一个大的汤。我需要分别分析每个h1。
最好的可能是一个列表,p每个h1标题都有一个标签向量。也许是一个类似的循环for (i in 1:length(html_nodes(fil, "h1"))) (html_children(html_nodes(fil, "h1")[i]))(这是行不通的)。
如果有一种方法可以从内部整理单词 html,则奖励 rvest
Note that > is the child combinator; the selector that you currently have looks for p elements that are children of an h1, which doesn't make sense in HTML and so returns nothing.
If you inspect the generated markup, at least in the example document that you've provided, you'll notice that every h1 element (as well as the heading for the table of contents, which is marked up as a p instead) has an associated parent div:
<body lang="EN-US">
<div class="WordSection1">
<p class="MsoTocHeading"><span lang="DA" class='c1'>Indholdsfortegnelse</span></p>
...
</div><span lang="DA" class='c5'><br clear="all" class='c4'></span>
<div class="WordSection2">
<h1><a name="_Toc285441761"><span lang="DA">Interview med Jakob skoleleder på
a_skolen</span></a></h1>
...
</div><span lang="DA" class='c5'><br clear="all" class='c4'></span>
<div class="WordSection3">
<h1><a name="_Toc285441762"><span lang="DA">Interviewet med Andreas skoleleder på
b_skolen</span></a></h1>
...
</div>
</body>
Run Code Online (Sandbox Code Playgroud)
All of the p elements in each section denoted by an h1 are found in its respective parent div. With this in mind, you could simply select p elements that are siblings of each h1. However, since rvest doesn't currently have a way to select siblings from a context node (html_nodes() only supports looking at a node's subtree, i.e. its descendants), you will need to do this another way.
Assuming HTML Tidy creates a structure where every h1 is in a div that is directly within body, you can grab every div except the table of contents using the following selector:
sections <- html_nodes(file, "body > div ~ div")
Run Code Online (Sandbox Code Playgroud)
In your example document, this should result in div.WordSection2 and div.WordSection3. The table of contents is represented by div.WordSection1, and that is excluded from the selection.
Then extract the paragraphs from each div:
for (section in sections) {
paras <- html_nodes(section, "p")
# Do stuff with paragraphs in each section...
print(length(paras))
}
# [1] 9
# [1] 8
Run Code Online (Sandbox Code Playgroud)
As you can see, length(paras) corresponds to the number of p elements in each div. Note that some of them contain nothing but an which may be troublesome depending on your needs. I'll leave dealing with those outliers as an exercise to the reader.
不幸的是,我没有加分,因为 rvest 不提供自己的 HTML Tidy 功能。您需要单独处理 Word 文档。