Tia*_*Qin 5 r rcurl r-download.file
我用 R 编写了一个小型下载器,以便一次运行从远程服务器下载一些日志文件:
file_remote <- fun_to_list_URLs()
file_local <- fun_to_gen_local_paths()
credentials <- "usr/pwd"
downloader <- function(file_remote, file_local, credentials) {
data_bin <- RCurl::getBinaryURL(
file_remote,
userpwd = credentials,
ftp.use.epsv = FALSE,
forbid.reuse = TRUE
)
writeBin(data_bin, file_local)
}
purrr::walk2(
file_remote,
file_local,
~ downloader(
file_remote = .x,
file_local = .y,
credentials = credentials
)
)
Run Code Online (Sandbox Code Playgroud)
这可行,但速度很慢,尤其是与 WinSCP 等一些 FTP 客户端相比,下载 64 个日志文件,每个 2kb,需要几分钟。
R中有没有更快的方法来下载大量文件?
该curl包有一种执行异步请求的方法,这意味着下载是同时执行的,而不是一个接一个地执行。特别是对于较小的文件,这应该会给您带来性能的大幅提升。这是一个执行此操作的准系统函数(从版本 5.0.0 开始,该curl包有一个该函数的本机版本也称为multi_download):
# total_con: max total concurrent connections.\n# host_con: max concurrent connections per host.\n# print: print status of requests at the end.\nmulti_download <- function(file_remote, \n file_local,\n total_con = 1000L, \n host_con = 1000L,\n print = TRUE) {\n \n # check for duplication (deactivated for testing)\n # dups <- duplicated(file_remote) | duplicated(file_local)\n # file_remote <- file_remote[!dups]\n # file_local <- file_local[!dups]\n \n # create pool\n pool <- curl::new_pool(total_con = total_con,\n host_con = host_con)\n \n # function performed on successful request\n save_download <- function(req) {\n writeBin(req$content, file_local[file_remote == req$url])\n }\n \n # setup async calls\n invisible(\n lapply(\n file_remote, function(f) \n curl::curl_fetch_multi(f, done = save_download, pool = pool)\n )\n )\n \n # all created requests are performed here\n out <- curl::multi_run(pool = pool)\n \n if (print) print(out)\n \n}\nRun Code Online (Sandbox Code Playgroud)\n现在我们需要一些测试文件将其与您的基准方法进行比较。我使用来自约翰霍普金斯大学 GitHub 页面的 covid 数据,因为它包含许多小型 csv 文件,这些文件应该与您的文件类似。
\nfile_remote <- paste0(\n "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/",\n format(seq(as.Date("2020-03-03"), as.Date("2022-06-01"), by = "day"), "%d-%m-%Y"),\n ".csv"\n)\nfile_local <- paste0("/home/johannes/Downloads/test/", seq_along(file_remote), ".bin")\nRun Code Online (Sandbox Code Playgroud)\n我们还可以从 URL 推断文件名,但我认为这不是您想要的。现在让我们比较一下这 821 个文件的方法:
\nres <- bench::mark(\n baseline(),\n multi_download(file_remote, \n file_local,\n print = FALSE),\n check = FALSE\n)\n#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.\nsummary(res)\n#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.\n#> # A tibble: 2 \xc3\x97 6\n#> expression min median `itr/sec`\n#> <bch:expr> <bch:> <bch:> <dbl>\n#> 1 baseline() 2.8m 2.8m 0.00595\n#> 2 multi_download(file_remote, file_local, print = FALSE) 12.7s 12.7s 0.0789 \n#> # \xe2\x80\xa6 with 2 more variables: mem_alloc <bch:byt>, `gc/sec` <dbl>\nsummary(res, relative = TRUE)\n#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.\n#> # A tibble: 2 \xc3\x97 6\n#> expression min median `itr/sec`\n#> <bch:expr> <dbl> <dbl> <dbl>\n#> 1 baseline() 13.3 13.3 1 \n#> 2 multi_download(file_remote, file_local, print = FALSE) 1 1 13.3\n#> # \xe2\x80\xa6 with 2 more variables: mem_alloc <dbl>, `gc/sec` <dbl>\nRun Code Online (Sandbox Code Playgroud)\n新方法比原来的方法快 13.3 倍。我认为您拥有的文件越多,差异就会越大。但请注意,该基准测试并不完美,因为我的网速波动很大。
\n该功能还应该在处理错误方面得到改进(目前您会收到一条消息,显示有多少个请求已成功以及有多少个错误,但没有指示哪些文件存在)。我的理解也是先将multi_run文件写入内存,然后save_download再将其写入磁盘。对于小文件来说这没问题,但对于较大的文件可能会出现问题。
baseline <- function() {\n credentials <- "usr/pwd"\n downloader <- function(file_remote, file_local, credentials) {\n data_bin <- RCurl::getBinaryURL(\n file_remote,\n userpwd = credentials,\n ftp.use.epsv = FALSE,\n forbid.reuse = TRUE\n )\n writeBin(data_bin, file_local)\n }\n \n purrr::walk2(\n file_remote,\n file_local,\n ~ downloader(\n file_remote = .x,\n file_local = .y,\n credentials = credentials\n )\n )\n}\nRun Code Online (Sandbox Code Playgroud)\n由reprex 包(v2.0.1)于 2022-06-05 创建
\n