对超过 1% 的字符进行 URL 解码

swo*_*olf 4 r urldecode

这应该是一件容易的事。

\n\n

假设我在 R 中有这个字符串:

\n\n

a <- "%C3%B6sterlich

\n\n

这意味着:

\n\n

\xc3\xb6sterlich(德语中的意思是“东风”)

\n\n

但是,如果我这样做URLdecode(a),我会得到:

\n\n

[1] "\xc3\x83\xc2\xb6sterlich"

\n\n

这在某种程度上是有道理的,因为在 ASCII URL 编码中,%C3 是 \xc3\x83,%B6 是 \xc2\xb6。但正如您在这里看到的:\n http://www.backbone.se/urlencodingUTF8.htm \n,%C3%B6 表示 UTF-8 编码中的 \xc3\xb6。

\n\n

现在的问题是:我如何告诉URLdecode()使用 UTF-8 表?

\n

Del*_*eet 5

我终于找到了解决这个问题的方法。这是我的用例和我尝试过的。

\n\n

这些是使用rvest抓取维基百科的,所以应该不会有问题。全部包含%但并非全部引起问题。

\n\n
#problem strings\nproblem_strs = c("Roscoe_%22Fatty%22_Arbuckle", "Michael_%22Atters%22_Attree", \n  "J%C3%BCrgen_Becker", "Vicco_von_B%C3%BClow", "B%C3%BClent_Ceylan", \n  "Se%C3%A1n_Cullen", "Chris_D%27Elia", "U%C4%9Fur_R%C4%B1fat_Karlova", \n  "Mike_Kr%C3%BCger", "Andr%C3%A9s_L%C3%B3pez_Forero", "Mo%27Nique", \n  "Jos%C3%A9_S%C3%A1nchez_Mota", "Dara_%C3%93_Briain", "Conan_O%27Brien", \n  "Mike_O%27Brien_(actor)", "Carroll_O%27Connor", "Donald_O%27Connor", \n  "Rosie_O%27Donnell", "Michael_O%27Donoghue", "Chris_O%27Dowd", \n  "Ardal_O%27Hanlon", "Catherine_O%27Hara", "Patrice_O%27Neal", \n  "Barunka_O%27Shaughnessy", "Raven-Symon%C3%A9", "Charles_%22Chic%22_Sale", \n  "No%C3%ABl_Wells", "%22Weird_Al%22_Yankovic", "Cem_Y%C4%B1lmaz"\n)\n
Run Code Online (Sandbox Code Playgroud)\n\n

首先尝试 base-r 解决方案。由于某种原因它没有矢量化,所以我们使用purrr

\n\n
#utils::URLdecode\nproblem_strs %>% purrr::map_chr(utils::URLdecode)\n\n[1] "Roscoe_\\"Fatty\\"_Arbuckle" "Michael_\\"Atters\\"_Attree" "J\xc3\x83\xc2\xbcrgen_Becker"            "Vicco_von_B\xc3\x83\xc2\xbclow"         \n[5] "B\xc3\x83\xc2\xbclent_Ceylan"            "Se\xc3\x83\xc2\xa1n_Cullen"              "Chris_D\'Elia"              "U\xc3\x84\xc5\xb8ur_R\xc3\x84\xc2\xb1fat_Karlova"     \n[9] "Mike_Kr\xc3\x83\xc2\xbcger"              "Andr\xc3\x83\xc2\xa9s_L\xc3\x83\xc2\xb3pez_Forero"     "Mo\'Nique"                  "Jos\xc3\x83\xc2\xa9_S\xc3\x83\xc2\xa1nchez_Mota"      \n[13] "Dara_\xc3\x83\xe2\x80\x9c_Briain"            "Conan_O\'Brien"             "Mike_O\'Brien_(actor)"      "Carroll_O\'Connor"         \n[17] "Donald_O\'Connor"           "Rosie_O\'Donnell"           "Michael_O\'Donoghue"        "Chris_O\'Dowd"             \n[21] "Ardal_O\'Hanlon"            "Catherine_O\'Hara"          "Patrice_O\'Neal"            "Barunka_O\'Shaughnessy"    \n[25] "Raven-Symon\xc3\x83\xc2\xa9"             "Charles_\\"Chic\\"_Sale"     "No\xc3\x83\xc2\xabl_Wells"               "\\"Weird_Al\\"_Yankovic"    \n[29] "Cem_Y\xc3\x84\xc2\xb1lmaz"\n
Run Code Online (Sandbox Code Playgroud)\n\n

如果我们将它们与之前的进行比较,我们可以看到模式:那些带有 2 %\ 的会导致问题。因此,我阅读了与 R 的 url 解码相关的所有问题,并找到了这些建议的解决方案:

\n\n
#urltools::url_decode\nurltools::url_decode(problem_strs)\n
Run Code Online (Sandbox Code Playgroud)\n\n

与之前的结果相同。

\n\n

编码是什么?尝试设置为UTF-8:

\n\n
> Encoding(problem_strs)\n [1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"\n[13] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"\n[25] "unknown" "unknown" "unknown" "unknown" "unknown"\n> #try to set\n> Encoding(problem_strs) = "UTF-8"\n> Encoding(problem_strs)\n [1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"\n[13] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"\n[25] "unknown" "unknown" "unknown" "unknown" "unknown"\n> Encoding(problem_strs) = "utf8"\n> Encoding(problem_strs)\n [1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"\n[13] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"\n[25] "unknown" "unknown" "unknown" "unknown" "unknown"\n> urltools::url_decode(problem_strs)\n
Run Code Online (Sandbox Code Playgroud)\n\n

输出与之前相同。

\n\n

有人提出了另一种检查和设置的方法:

\n\n
> problem_strs = iconv(problem_strs, from = "ASCII", to = "UTF-8")\n> Encoding(problem_strs)\n [1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"\n[13] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"\n[25] "unknown" "unknown" "unknown" "unknown" "unknown"\n
Run Code Online (Sandbox Code Playgroud)\n\n

我在列表中找到了另一个包:

\n\n
> #Ruchardet to detect?\n> Ruchardet::detectEncoding(problem_strs)\n [1] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""\n\n#Is it simpler than we thought?\nurltools::url_decode(problem_strs) %>% urltools::url_decode()\n
Run Code Online (Sandbox Code Playgroud)\n\n

相同的输出。

\n\n

因此,我在谷歌上搜索了导致问题的特定模式,例如%C3%BC. 因此,这里为 php 提供了一半的答案

\n\n

首先你需要对其进行 urldecode,这将为你提供 \xc3\x83\xc2\xbc,这是 \xc3\xbc 的 UTF8 编码表示形式,所以你应该一切都好。

\n\n

好的,让我们在 R 中尝试一下:

\n\n
#url decode, then set utf\nhalfway = urltools::url_decode(problem_strs)\nEncoding(halfway) = "UTF-8"\nhalfway\n [1] "Roscoe_\\"Fatty\\"_Arbuckle" "Michael_\\"Atters\\"_Attree" "J\xc3\xbcrgen_Becker"             "Vicco_von_B\xc3\xbclow"          \n [5] "B\xc3\xbclent_Ceylan"             "Se\xc3\xa1n_Cullen"               "Chris_D\'Elia"              "U\xc4\x9fur_R\xc4\xb1fat_Karlova"       \n [9] "Mike_Kr\xc3\xbcger"               "Andr\xc3\xa9s_L\xc3\xb3pez_Forero"       "Mo\'Nique"                  "Jos\xc3\xa9_S\xc3\xa1nchez_Mota"        \n[13] "Dara_\xc3\x93_Briain"             "Conan_O\'Brien"             "Mike_O\'Brien_(actor)"      "Carroll_O\'Connor"         \n[17] "Donald_O\'Connor"           "Rosie_O\'Donnell"           "Michael_O\'Donoghue"        "Chris_O\'Dowd"             \n[21] "Ardal_O\'Hanlon"            "Catherine_O\'Hara"          "Patrice_O\'Neal"            "Barunka_O\'Shaughnessy"    \n[25] "Raven-Symon\xc3\xa9"              "Charles_\\"Chic\\"_Sale"     "No\xc3\xabl_Wells"                "\\"Weird_Al\\"_Yankovic"    \n[29] "Cem_Y\xc4\xb1lmaz"               \n
Run Code Online (Sandbox Code Playgroud)\n\n

这是一个可重用的函数:

\n\n
url_decode_utf = function(x) {\n  y = urltools::url_decode(x)\n  Encoding(y) = "UTF-8"\n  y\n}\n
Run Code Online (Sandbox Code Playgroud)\n