在R中以粗体标识网络链接

Agu*_*cho 8 html r httr rvest

以下脚本允许我访问具有多个具有相似名称的链接的网站.我想只得到其中一个,因为它在网站上以粗体显示,可以与其他人区别开来.但是,我找不到在列表中选择粗体链接的方法.

有人会对此有所了解吗?提前致谢!

library(httr)
library(rvest)
sp="Alnus japonica"

res <- httr::POST(url ="http://apps.kew.org/wcsp/advsearch.do", 
              body = list(page ="advancedSearch", 
                          AttachmentExist ="", 
                          family ="", 
                          placeOfPub ="", 
                          genus = unlist(strsplit(as.character(sp), split="         "))[1], 
                          yearPublished ="", 
                          species = unlist(strsplit(as.character(sp), split="    "))[2], 
                          author ="", 
                          infraRank ="", 
                          infraEpithet ="", 
                          selectedLevel ="cont"), 
              encode ="form") 
pg <- content(res, as="parsed") 
lnks <- html_attr(html_nodes(pg,"a"),"href")
#how get the url of the link wth accepted name (in bold)?
res2 <- try(GET(sprintf("http://apps.kew.org%s", lnks[grep("id=",lnks)]      [1])),silent=T)
#this gets a link but often fails to get the bold one
Run Code Online (Sandbox Code Playgroud)

hrb*_*str 9

首先,抓住tidy-html5(它几乎适用于所有东西)并安装它并确保它在你的PATH.

正如我的评论所说,浏览器处理<b>外部<p>因为它们需要防弹.libxml2才不是.所以,我们需要首先清理它(我现在需要创建一个新tidyhtml包),然后处理整理版本:

library(xml2)
library(httr)
library(rvest)

res <- httr::POST(url ="http://apps.kew.org/wcsp/advsearch.do", 
              body = list(page ="advancedSearch", 
                          AttachmentExist ="", 
                          family ="", 
                          placeOfPub ="", 
                          genus = "Alnus", 
                          yearPublished ="", 
                          species = "japonica", 
                          author ="", 
                          infraRank ="", 
                          infraEpithet ="", 
                          selectedLevel ="cont"), 
              encode ="form") 

tf <- tempfile(fileext=".html")
cat(content(res, as="text"), file=tf)

tidy <- system2("tidy", c("-q", tf), TRUE)

pg <- read_html(paste0(tidy, sep="", collapse=""))

html_nodes(pg, xpath=".//p/b/a[contains(@href, 'name_id')]")

## {xml_nodeset (1)}
## [1] <a href="/wcsp/namedetail.do?name_id=6471" class="onwa ...
Run Code Online (Sandbox Code Playgroud)

如果需要CSS选择器而不是XPath:

html_nodes(pg, "p > b > a[href*='name_id']")
Run Code Online (Sandbox Code Playgroud)

UPDATE

我开始了一个基本的pkg包装libtidy.如果您使用的是OS X并使用Homebrew,您可以:( brew install tidy-html5安装上面的二进制文件和libtidy库)并devtools::install_github("hrbrmstr/tidyhtml")安装pkg.然后,它只是:

library(xml2)
library(httr)
library(rvest)
library(htmltidy)

res <- httr::POST(url ="http://apps.kew.org/wcsp/advsearch.do", 
              body = list(page ="advancedSearch", 
                          AttachmentExist ="", 
                          family ="", 
                          placeOfPub ="", 
                          genus = "Alnus", 
                          yearPublished ="", 
                          species = "japonica", 
                          author ="", 
                          infraRank ="", 
                          infraEpithet ="", 
                          selectedLevel ="cont"), 
              encode ="form") 

tidy_html <- tidy(content(res, as="text"))

pg <- read_html(tidy_html)

html_nodes(pg, "p > b > a[href*='name_id']")
Run Code Online (Sandbox Code Playgroud)

我应该能够在Windows和Linux上使用它并使其成为一个真正的软件包(它是一个薄的包装器,现在没有错误检查)但是这将在TODO上停留一段时间.


Mic*_*ico 1

在我看来,这里可能存在rvest/ 的错误httr,因为<b>似乎围绕<a href...>相关链接,但在解析的版本中却没有。

我用了:

library(rvest)
sp=strsplit("Alnus japonica", " ")[[1]]

session <- html_session("http://apps.kew.org/wcsp/advsearch.do")
form <- html_form(session)[[1]]

filled_form <- set_values(form, genus = sp[1], species = sp[2])

out <- submit_form(session, filled_form)
Run Code Online (Sandbox Code Playgroud)

请看以下内容:

out %>% html_nodes(xpath = "descendant-or-self::*") %>% `[`(81:90)
# {xml_nodeset (10)}
#  [1] <p><a href="/wcsp/namedetail.do;jsessionid=F6180417706056852E58C1E290B5087A? ...
#  [2] <a href="/wcsp/namedetail.do;jsessionid=F6180417706056852E58C1E290B5087A?nam ...
#  [3] <i>Alnus</i>
#  [4] <i> japonica</i>
#  [5] <b>\n        </b>
#  [6] <p><a href="/wcsp/namedetail.do;jsessionid=F6180417706056852E58C1E290B5087A? ...
#  [7] <a href="/wcsp/namedetail.do;jsessionid=F6180417706056852E58C1E290B5087A?nam ...
#  [8] <i>Alnus</i>
#  [9] <i> japonica</i>
# [10] <p><a # href="/wcsp/namedetail.do;jsessionid=F6180417706056852E58C1E290B5087A? ...
Run Code Online (Sandbox Code Playgroud)

如您所见,该<b>节点显示为空。但是,当我在 Chrome 上手动输入搜索时View Source,我看到:

<b>
    <p><a href="/wcsp/namedetail.do?name_id=6471" class="onwardnav"><i>Alnus</i><i> japonica</i> (Thunb.) Steud., Nomencl. Bot., ed. 2, 1: 55 (1840).</a>
    </p>
</b>
Run Code Online (Sandbox Code Playgroud)

相关性<a>介于 和 之间<b></b>告诉我它应该是 that 的子项<b>,但这显示为空白:

out %>% html_nodes(xpath = "//b/child::*")
Run Code Online (Sandbox Code Playgroud)

我承认我不是xpath专家,所以我可能会把事情搞砸。希望这可以帮助您上路。