在 R 中下载多个文件的更快方法

Tia*_*Qin 5 r rcurl r-download.file

我用 R 编写了一个小型下载器,以便一次运行从远程服务器下载一些日志文件:

file_remote <- fun_to_list_URLs()
file_local <- fun_to_gen_local_paths()
credentials <- "usr/pwd"

downloader <- function(file_remote, file_local, credentials) {
  data_bin <- RCurl::getBinaryURL(
    file_remote,
    userpwd = credentials,
    ftp.use.epsv = FALSE,
    forbid.reuse = TRUE
  )
  
  writeBin(data_bin, file_local)
}
  
purrr::walk2(
  file_remote,
  file_local,
  ~ downloader(
    file_remote = .x,
    file_local = .y,
    credentials = credentials
  )
)
Run Code Online (Sandbox Code Playgroud)

这可行,但速度很慢,尤其是与 WinSCP 等一些 FTP 客户端相比,下载 64 个日志文件,每个 2kb,需要几分钟。

R中有没有更快的方法来下载大量文件?

JBG*_*ber 8

curl包有一种执行异步请求的方法,这意味着下载是同时执行的,而不是一个接一个地执行。特别是对于较小的文件,这应该会给您带来性能的大幅提升。这是一个执行此操作的准系统函数(从版本 5.0.0 开始,该curl包有一个该函数的本机版本也称为multi_download):

\n
# total_con: max total concurrent connections.\n# host_con: max concurrent connections per host.\n# print: print status of requests at the end.\nmulti_download <- function(file_remote, \n                           file_local,\n                           total_con = 1000L, \n                           host_con  = 1000L,\n                           print = TRUE) {\n  \n  # check for duplication (deactivated for testing)\n  # dups <- duplicated(file_remote) | duplicated(file_local)\n  # file_remote <- file_remote[!dups]\n  # file_local <- file_local[!dups]\n  \n  # create pool\n  pool <- curl::new_pool(total_con = total_con,\n                         host_con = host_con)\n  \n  # function performed on successful request\n  save_download <- function(req) {\n    writeBin(req$content, file_local[file_remote == req$url])\n  }\n  \n  # setup async calls\n  invisible(\n    lapply(\n      file_remote, function(f) \n        curl::curl_fetch_multi(f, done = save_download, pool = pool)\n    )\n  )\n  \n  # all created requests are performed here\n  out <- curl::multi_run(pool = pool)\n  \n  if (print) print(out)\n  \n}\n
Run Code Online (Sandbox Code Playgroud)\n

现在我们需要一些测试文件将其与您的基准方法进行比较。我使用来自约翰霍普金斯大学 GitHub 页面的 covid 数据,因为它包含许多小型 csv 文件,这些文件应该与您的文件类似。

\n
file_remote <- paste0(\n  "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/",\n  format(seq(as.Date("2020-03-03"), as.Date("2022-06-01"), by = "day"), "%d-%m-%Y"),\n  ".csv"\n)\nfile_local <- paste0("/home/johannes/Downloads/test/", seq_along(file_remote), ".bin")\n
Run Code Online (Sandbox Code Playgroud)\n

我们还可以从 URL 推断文件名,但我认为这不是您想要的。现在让我们比较一下这 821 个文件的方法:

\n
res <- bench::mark(\n  baseline(),\n  multi_download(file_remote, \n                 file_local,\n                 print = FALSE),\n  check = FALSE\n)\n#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.\nsummary(res)\n#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.\n#> # A tibble: 2 \xc3\x97 6\n#>   expression                                                min median `itr/sec`\n#>   <bch:expr>                                             <bch:> <bch:>     <dbl>\n#> 1 baseline()                                               2.8m   2.8m   0.00595\n#> 2 multi_download(file_remote, file_local, print = FALSE)  12.7s  12.7s   0.0789 \n#> # \xe2\x80\xa6 with 2 more variables: mem_alloc <bch:byt>, `gc/sec` <dbl>\nsummary(res, relative = TRUE)\n#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.\n#> # A tibble: 2 \xc3\x97 6\n#>   expression                                               min median `itr/sec`\n#>   <bch:expr>                                             <dbl>  <dbl>     <dbl>\n#> 1 baseline()                                              13.3   13.3       1  \n#> 2 multi_download(file_remote, file_local, print = FALSE)   1      1        13.3\n#> # \xe2\x80\xa6 with 2 more variables: mem_alloc <dbl>, `gc/sec` <dbl>\n
Run Code Online (Sandbox Code Playgroud)\n

新方法比原来的方法快 13.3 倍。我认为您拥有的文件越多,差异就会越大。但请注意,该基准测试并不完美,因为我的网速波动很大。

\n

该功能还应该在处理错误方面得到改进(目前您会收到一条消息,显示有多少个请求已成功以及有多少个错误,但没有指示哪些文件存在)。我的理解也是先将multi_run文件写入内存,然后save_download再将其写入磁盘。对于小文件来说这没问题,但对于较大的文件可能会出现问题。

\n

基线函数

\n
baseline <- function() {\n  credentials <- "usr/pwd"\n  downloader <- function(file_remote, file_local, credentials) {\n    data_bin <- RCurl::getBinaryURL(\n      file_remote,\n      userpwd = credentials,\n      ftp.use.epsv = FALSE,\n      forbid.reuse = TRUE\n    )\n    writeBin(data_bin, file_local)\n  }\n  \n  purrr::walk2(\n    file_remote,\n    file_local,\n    ~ downloader(\n      file_remote = .x,\n      file_local = .y,\n      credentials = credentials\n    )\n  )\n}\n
Run Code Online (Sandbox Code Playgroud)\n

由reprex 包(v2.0.1)于 2022-06-05 创建

\n