R - 将向量传递给自定义函数到dplyr :: mutate

Tim*_* S. 3 r dplyr

我有以下功能,允许我从其URL中删除维基百科内容(确切内容与此问题无关)

getPageContent <- function(url) {

        library(rvest)
        library(magrittr)

        pc <- html(url) %>% 
                html_node("#mw-content-text") %>% 
                # strip tags
                html_text() %>%
                # concatenate vector of texts into one string
                paste(collapse = "")

        pc
}
Run Code Online (Sandbox Code Playgroud)

在特定URL上使用该功能时,这是有效的.

getPageContent("https://en.wikipedia.org/wiki/Balance_(game_design)")

[1] "In game design, balance is the concept and the practice of tuning a game's rules, usually with the goal of preventing any of its component systems from being ineffective or otherwise undesirable when compared to their peers. An unbalanced system represents wasted development resources at the very least, and at worst can undermine the game's entire ruleset by making impo (...)
Run Code Online (Sandbox Code Playgroud)

但是,如果我想传递函数来dplyr获取多个页面的内容,我会收到一个错误:

example <- data.frame(url = c("https://en.wikipedia.org/wiki/Balance_(game_design)",
                              "https://en.wikipedia.org/wiki/Koncerthuset",
                              "https://en.wikipedia.org/wiki/Tifama_chera",
                              "https://en.wikipedia.org/wiki/Difference_theory"),
                      stringsAsFactors = FALSE
                      )

library(dplyr)
example <- mutate(example, content = getPageContent(url))

Error: length(url) == 1 ist nicht TRUE
In addition: Warning message:
In mutate_impl(.data, dots) :
  the condition has length > 1 and only the first element will be used
Run Code Online (Sandbox Code Playgroud)

看看错误,我认为问题在于getPageContent无法处理URL向量,但我不知道如何解决它.

++++

编辑:提出的两个解决方案 - 1)使用rowwise()和2)使用sapply()两者都很好.使用10篇随机WP文章进行模拟,第二种方法的速度提高了25%:

> system.time(
+         example <- example %>% 
+                 rowwise() %>% 
+                 mutate(content = getPageContent(url)) 
+ )
       User      System verstrichen 
       0.39        0.14        1.21 
> 
> 
> system.time(
+         example$content <- unlist(lapply(example$url, getPageContent))
+ )
       User      System verstrichen 
       0.49        0.11        0.90 
Run Code Online (Sandbox Code Playgroud)

akr*_*run 10

你可以使用rowwise(),它会工作

 res <- example %>% 
             rowwise() %>% 
             mutate(content=getPageContent(url))
Run Code Online (Sandbox Code Playgroud)