这应该是一件容易的事。
\n\n假设我在 R 中有这个字符串:
\n\na <- "%C3%B6sterlich
这意味着:
\n\n\xc3\xb6sterlich(德语中的意思是“东风”)
但是,如果我这样做URLdecode(a),我会得到:
[1] "\xc3\x83\xc2\xb6sterlich"
这在某种程度上是有道理的,因为在 ASCII URL 编码中,%C3 是 \xc3\x83,%B6 是 \xc2\xb6。但正如您在这里看到的:\n http://www.backbone.se/urlencodingUTF8.htm \n,%C3%B6 表示 UTF-8 编码中的 \xc3\xb6。
\n\n现在的问题是:我如何告诉URLdecode()使用 UTF-8 表?
我终于找到了解决这个问题的方法。这是我的用例和我尝试过的。
\n\n这些是使用rvest抓取维基百科的,所以应该不会有问题。全部包含%但并非全部引起问题。
#problem strings\nproblem_strs = c("Roscoe_%22Fatty%22_Arbuckle", "Michael_%22Atters%22_Attree", \n "J%C3%BCrgen_Becker", "Vicco_von_B%C3%BClow", "B%C3%BClent_Ceylan", \n "Se%C3%A1n_Cullen", "Chris_D%27Elia", "U%C4%9Fur_R%C4%B1fat_Karlova", \n "Mike_Kr%C3%BCger", "Andr%C3%A9s_L%C3%B3pez_Forero", "Mo%27Nique", \n "Jos%C3%A9_S%C3%A1nchez_Mota", "Dara_%C3%93_Briain", "Conan_O%27Brien", \n "Mike_O%27Brien_(actor)", "Carroll_O%27Connor", "Donald_O%27Connor", \n "Rosie_O%27Donnell", "Michael_O%27Donoghue", "Chris_O%27Dowd", \n "Ardal_O%27Hanlon", "Catherine_O%27Hara", "Patrice_O%27Neal", \n "Barunka_O%27Shaughnessy", "Raven-Symon%C3%A9", "Charles_%22Chic%22_Sale", \n "No%C3%ABl_Wells", "%22Weird_Al%22_Yankovic", "Cem_Y%C4%B1lmaz"\n)\nRun Code Online (Sandbox Code Playgroud)\n\n首先尝试 base-r 解决方案。由于某种原因它没有矢量化,所以我们使用purrr:
\n\n#utils::URLdecode\nproblem_strs %>% purrr::map_chr(utils::URLdecode)\n\n[1] "Roscoe_\\"Fatty\\"_Arbuckle" "Michael_\\"Atters\\"_Attree" "J\xc3\x83\xc2\xbcrgen_Becker" "Vicco_von_B\xc3\x83\xc2\xbclow" \n[5] "B\xc3\x83\xc2\xbclent_Ceylan" "Se\xc3\x83\xc2\xa1n_Cullen" "Chris_D\'Elia" "U\xc3\x84\xc5\xb8ur_R\xc3\x84\xc2\xb1fat_Karlova" \n[9] "Mike_Kr\xc3\x83\xc2\xbcger" "Andr\xc3\x83\xc2\xa9s_L\xc3\x83\xc2\xb3pez_Forero" "Mo\'Nique" "Jos\xc3\x83\xc2\xa9_S\xc3\x83\xc2\xa1nchez_Mota" \n[13] "Dara_\xc3\x83\xe2\x80\x9c_Briain" "Conan_O\'Brien" "Mike_O\'Brien_(actor)" "Carroll_O\'Connor" \n[17] "Donald_O\'Connor" "Rosie_O\'Donnell" "Michael_O\'Donoghue" "Chris_O\'Dowd" \n[21] "Ardal_O\'Hanlon" "Catherine_O\'Hara" "Patrice_O\'Neal" "Barunka_O\'Shaughnessy" \n[25] "Raven-Symon\xc3\x83\xc2\xa9" "Charles_\\"Chic\\"_Sale" "No\xc3\x83\xc2\xabl_Wells" "\\"Weird_Al\\"_Yankovic" \n[29] "Cem_Y\xc3\x84\xc2\xb1lmaz"\nRun Code Online (Sandbox Code Playgroud)\n\n如果我们将它们与之前的进行比较,我们可以看到模式:那些带有 2 %\ 的会导致问题。因此,我阅读了与 R 的 url 解码相关的所有问题,并找到了这些建议的解决方案:
#urltools::url_decode\nurltools::url_decode(problem_strs)\nRun Code Online (Sandbox Code Playgroud)\n\n与之前的结果相同。
\n\n编码是什么?尝试设置为UTF-8:
\n\n> Encoding(problem_strs)\n [1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"\n[13] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"\n[25] "unknown" "unknown" "unknown" "unknown" "unknown"\n> #try to set\n> Encoding(problem_strs) = "UTF-8"\n> Encoding(problem_strs)\n [1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"\n[13] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"\n[25] "unknown" "unknown" "unknown" "unknown" "unknown"\n> Encoding(problem_strs) = "utf8"\n> Encoding(problem_strs)\n [1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"\n[13] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"\n[25] "unknown" "unknown" "unknown" "unknown" "unknown"\n> urltools::url_decode(problem_strs)\nRun Code Online (Sandbox Code Playgroud)\n\n输出与之前相同。
\n\n有人提出了另一种检查和设置的方法:
\n\n> problem_strs = iconv(problem_strs, from = "ASCII", to = "UTF-8")\n> Encoding(problem_strs)\n [1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"\n[13] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"\n[25] "unknown" "unknown" "unknown" "unknown" "unknown"\nRun Code Online (Sandbox Code Playgroud)\n\n我在列表中找到了另一个包:
\n\n> #Ruchardet to detect?\n> Ruchardet::detectEncoding(problem_strs)\n [1] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""\n\n#Is it simpler than we thought?\nurltools::url_decode(problem_strs) %>% urltools::url_decode()\nRun Code Online (Sandbox Code Playgroud)\n\n相同的输出。
\n\n因此,我在谷歌上搜索了导致问题的特定模式,例如%C3%BC. 因此,这里为 php 提供了一半的答案。
首先你需要对其进行 urldecode,这将为你提供 \xc3\x83\xc2\xbc,这是 \xc3\xbc 的 UTF8 编码表示形式,所以你应该一切都好。
\n\n好的,让我们在 R 中尝试一下:
\n\n#url decode, then set utf\nhalfway = urltools::url_decode(problem_strs)\nEncoding(halfway) = "UTF-8"\nhalfway\n [1] "Roscoe_\\"Fatty\\"_Arbuckle" "Michael_\\"Atters\\"_Attree" "J\xc3\xbcrgen_Becker" "Vicco_von_B\xc3\xbclow" \n [5] "B\xc3\xbclent_Ceylan" "Se\xc3\xa1n_Cullen" "Chris_D\'Elia" "U\xc4\x9fur_R\xc4\xb1fat_Karlova" \n [9] "Mike_Kr\xc3\xbcger" "Andr\xc3\xa9s_L\xc3\xb3pez_Forero" "Mo\'Nique" "Jos\xc3\xa9_S\xc3\xa1nchez_Mota" \n[13] "Dara_\xc3\x93_Briain" "Conan_O\'Brien" "Mike_O\'Brien_(actor)" "Carroll_O\'Connor" \n[17] "Donald_O\'Connor" "Rosie_O\'Donnell" "Michael_O\'Donoghue" "Chris_O\'Dowd" \n[21] "Ardal_O\'Hanlon" "Catherine_O\'Hara" "Patrice_O\'Neal" "Barunka_O\'Shaughnessy" \n[25] "Raven-Symon\xc3\xa9" "Charles_\\"Chic\\"_Sale" "No\xc3\xabl_Wells" "\\"Weird_Al\\"_Yankovic" \n[29] "Cem_Y\xc4\xb1lmaz" \nRun Code Online (Sandbox Code Playgroud)\n\n这是一个可重用的函数:
\n\nurl_decode_utf = function(x) {\n y = urltools::url_decode(x)\n Encoding(y) = "UTF-8"\n y\n}\nRun Code Online (Sandbox Code Playgroud)\n