几周前,这里有人帮助我极大地获得了Notable Names数据库中所有链接的列表。我能够运行此代码并获得以下输出
library(purrr)
library(rvest)
url_base <- "https://www.nndb.com/lists/494/000063305/"
## Gets A-Z links
all_surname_urls <- read_html(url_base) %>%
html_nodes(".newslink") %>%
html_attrs() %>%
map(pluck(1, 1))
all_ppl_urls <- map(
all_surname_urls,
function(x) read_html(x) %>%
html_nodes("a") %>%
html_attrs() %>%
map(pluck(1, 1))
) %>%
unlist()
all_ppl_urls <- setdiff(
all_ppl_urls[!duplicated(all_ppl_urls)],
c(all_surname_urls, "http://www.nndb.com/")
)
all_ppl_urls[1] %>%
read_html() %>%
html_nodes("p") %>%
html_text()
# [1] "AKA Lee William Aaker"
# [2] "Born: 25-Sep-1943Birthplace: Los Angeles, CA"
# [3] "Gender: MaleRace or Ethnicity: WhiteOccupation: Actor"
# [4] "Nationality: United StatesExecutive summary: The Adventures of Rin Tin Tin"
# ...
Run Code Online (Sandbox Code Playgroud)
我最初的意图是获得一个数据框,其中将人的名字,他们的性别,种族,职业和国籍合并为一个数据框。
如果您的数据位于html表中,那么我在这里和其他站点上看到的许多问题都会有所帮助,而著名的名称数据库却并非如此。我知道所有4万个站点都需要涉及一个循环,但是经过一个周末的寻找答案之后,我似乎找不到答案。有人可以协助吗?
编辑以添加内容, 我尝试遵循此处的一些规则,但此请求稍微复杂一点
## I tried to run list <- all_ppl_urls%>% map(read_html) but that was taking a LONG time so I decided to just get the first ten links for the sake of showing my example:
example <- head(all_ppl_urls, 10)
list <- example %>% map(read_html)
test <-list %>% map_df(~{
text_1 <- html_nodes(.x, 'p , b') %>% html_text
Run Code Online (Sandbox Code Playgroud)
并且我收到此错误:错误:另外:警告消息:关闭未使用的连接3(http://www.nndb.com/people/965/000279128/)
更新
\n\n包括无法正确解析配置文件的错误例程。如果出现任何错误,您将得到一行NA(即使可以正确解析某些信息 - 这是因为我们一次读取所有字段并且我们依赖所有字段都可以读取)。
也许您想进一步开发该代码以返回部分信息?您可以通过依次读取字段(而不是一次)来完成此操作,如果出现错误,则返回该字段而不是整行的 NA。然而,这有一个缺点,即代码不仅需要为每个配置文件解析一次文档,而且需要解析多次。
\n\n这是一个依赖于Xpath选择相关字段的函数:
library(rvest)\nlibrary(glue)\nlibrary(tibble)\nlibrary(dplyr)\nlibrary(purrr)\n\nscrape_profile <- function(url) {\n fields <- c("Gender:", "Race or Ethnicity:", "Occupation:", "Nationality:")\n filter <- glue("contains(text(), \'{fields}\')") %>%\n paste0(collapse = " or ")\n xp_string <- glue("//b[{filter}]/following::text()[normalize-space()!=\'\'][1]") \n tryCatch({\n doc <- read_html(url)\n name <- doc %>%\n html_node(xpath = "(//b/text())[1]") %>% \n as.character()\n doc %>%\n html_nodes(xpath = xp_string) %>%\n as.character() %>%\n gsub("^\\\\s|\\\\s$", "", .) %>%\n as.list() %>%\n setNames(c("Gender", "Race", "Occupation", "Nationality")) %>%\n as_tibble() %>%\n mutate(Name = name) %>%\n select(Name, everything())\n }, error = function(err) {\n message(glue("Profile <{url}> could not be parsed properly."))\n tibble(Name = ifelse(exists("name"), name, NA), Gender = NA,\n Race = NA, Occupation = NA,\n Nationality = NA)\n })\n}\nRun Code Online (Sandbox Code Playgroud)\n\n您现在所要做的就是应用scrape_profile到您的所有个人资料网址:
map_dfr(all_ppl_urls[1:5], scrape_profile)\n# # A tibble: 5 x 5\n# Name Gender Race Occupation Nationality \n# <chr> <chr> <chr> <chr> <chr> \n# 1 Lee Aaker Male White Actor United States\n# 2 Aaliyah Female Black Singer United States\n# 3 Alvar Aalto Male White Architect Finland \n# 4 Willie Aames Male White Actor United States\n# 5 Kjetil Andr\xc3\xa9 Aamodt Male White Skier Norway \nRun Code Online (Sandbox Code Playgroud)\n\n解释
\n\n<b>标签),有时还有链接标签(<a>)。css或一个 XPath选择器。然而,由于我们想要选择文本节点,XPath似乎是唯一的(?)选项://b[contains(text(), "Gender:")]/following::text()[normalize-space()!=\' \'][1]selects\n\n::text()[normalize-space()!=\' \'][1]是/following)<b>标签(//b)其中Gender:( [contains(text(), "Gender:")])Xpath匹配多个元素的标签,从而避免显式循环。为此,我们将多个contains(.)语句粘贴在一起,并用分隔符分隔ortibble<b>) 文本| 归档时间: |
|
| 查看次数: |
175 次 |
| 最近记录: |