如何使用 R 拆分没有分隔符的合并/粘合单词

Question

如何使用 R 拆分没有分隔符的合并/粘合单词

Zaw*_*min 0 r text-mining gsub strsplit rvest

我使用 R 中的 rvest 使用以下代码从本文页面中抓取文本关键字：

#install.packages("xml2") # required for rvest
library("rvest") # for web scraping
library("dplyr") # for data management

#' start with get the link for the web to be scraped
page <- read_html("https://www.sciencedirect.com/science/article/pii/S1877042810004568")
keyW <- page %>% html_nodes("div.Keywords.u-font-serif") %>% html_text() %>% paste(collapse = ",")

Run Code Online (Sandbox Code Playgroud)

它给了我：

> keyW    
[1] "KeywordsPhysics curriculumTurkish education systemfinnish education systemPISAphysics achievement"

Run Code Online (Sandbox Code Playgroud)

使用以下代码行从字符串中删除单词“Keywords”及其之前的所有内容后：

keyW <- gsub(".*Keywords","", keyW)

Run Code Online (Sandbox Code Playgroud)

新的密钥W是：

[1] "Physics curriculumTurkish education systemfinnish education systemPISAphysics achievement"

Run Code Online (Sandbox Code Playgroud)

但是，我想要的输出是这个列表：

[1] "Physics curriculum" "Turkish education system" "finnish education system" "PISA" "physics achievement"

Run Code Online (Sandbox Code Playgroud)

我应该如何解决这个问题？我认为这可以归结为：

如何正确地从网站中抓取关键词
如何正确分割字符串

谢谢

Answer 1

Ron*_*hah 5

如果使用span标签来提取单词，则可以直接获得预期的输出。

library(rvest)
page %>%  html_nodes("div.Keywords span") %>% html_text()

#[1] "Physics curriculum"       "Turkish education system" "finnish education system"
#[4] "PISA"                     "physics achievement"

Run Code Online (Sandbox Code Playgroud)

归档时间：	4 年，10 月前
查看次数：	93 次
最近记录：	4 年，10 月前