使用rvest在h后刮掉所有p？(或其他R包)

Question

使用rvest在h后刮掉所有p？(或其他R包)

我是html抓取世界的新手,我很难在特定标题下拉入段落,在R中使用rvest

我想从多个站点中获取信息,这些站点都具有相对类似的设置.它们都有相同的标题,但标题下的段落数量可能会发生变化.我能够使用以下代码在标题下刮取特定段落:

unitCode <- data.frame(unit = c('SLE010', 'SLE115', 'MAA103'))

html <- sapply(unitCode, function(x) paste("http://www.deakin.edu.au/current-students/courses/unit.php?unit=", 
                                          x,
                                          "&return_to=%2Fcurrent-students%2Fcourses%2Fcourse.php%3Fcourse%3DS323%26version%3D3", 
                                          sep = ''))
assessment <- html[3] %>%
              html() %>%
              html_nodes(xpath='//*[@id="main"]/div/div/p[3]') %>%
              html_text()

Run Code Online (Sandbox Code Playgroud)

'xpath'元素引入评估标题下的第一段.有些页面在评估标题下有多个段落,如果我改变'xpath'变量来具体指定它们,我可以获得,例如p [4]或p [5].不幸的是,我想在数百页上迭代这个过程,所以每次更改xpath是不合适的,我甚至不知道每个页面中会有多少段落.

我认为,考虑到页面设置的不确定性,在我感兴趣的标题之后拉出所有<p>是最好的选择.

我想知道是否有办法在<h3>评估<h3>之后使用rvest或其他一些R刮包来刮掉所有<p>？

Answer 1

hrb*_*str 9

我将其扩展仅用于演示目的.您应该能够将其应用于原始代码.覆盖最终使用的命名空间中的名称真的不是一个好主意.另请注意,我使用的rvest是使用xml2和弃用的最新版本(github/devtools版本)html.

关键是xpath="//h3[contains(., 'Assessment')]/following-sibling::p",因此:

library(rvest)

unitCode <- data.frame(unit = c('SLE010', 'SLE115', 'MAA103'))

sites <- sapply(unitCode, function(x) paste("http://www.deakin.edu.au/current-students/courses/unit.php?unit=", 
                                          x,
                                          "&return_to=%2Fcurrent-students%2Fcourses%2Fcourse.php%3Fcourse%3DS323%26version%3D3", 
                                          sep = ''))

pg <- read_html(sites[1])
pg_2 <- read_html(sites[2])
pg_3 <- read_html(sites[3])

pg %>% html_nodes(xpath="//h3[contains(., 'Assessment')]/following-sibling::p")

## {xml_nodeset (2)}
## [1] <p>This unit is assessed on a pass/fail basis. Multiple-choice on-line test   ...
## [2] <p style="margin-top: 2em;">\n  <a href="/current-students/courses/course.php ...

pg_2 %>% html_nodes(xpath="//h3[contains(., 'Assessment')]/following-sibling::p")

## {xml_nodeset (3)}
## [1] <p>Mid-trimester test 20%, three assignments (3 x 10%) 30%, examination 50%.</p>
## [2] <p>* Rate for all CSP students, except for those who commenced Education and  ...
## [3] <p style="margin-top: 2em;">\n  <a href="/current-students/courses/course.php ...

pg_3 %>% html_nodes(xpath="//h3[contains(., 'Assessment')]/following-sibling::p")

## {xml_nodeset (6)}
## [1] <p>Assessment 1 (Group of 3 students) - Student video presentation (5-7 mins) ...
## [2] <p>Assessment 2 (Group of 3 students) - Business plan (3500-4000 words) - 30% ...
## [3] <p>Examination (2 hours) - 60%</p>
## [4] <p><a href="http://www.deakin.edu.au/glossary?result_1890_result_page=H" targ ...
## [5] <p>* Rate for all CSP students, except for those who commenced Education and  ...
## [6] <p style="margin-top: 2em;">\n  <a href="/current-students/courses/course.php ...

Run Code Online (Sandbox Code Playgroud)

您也可以将其<p style="margin-top: 2em;">用作标记来停止.你应该检查出xml2的as_list帮助.

归档时间：	10 年，11 月前
查看次数：	1756 次
最近记录：	10 年，11 月前