如何在数据框中修复编码而不管其在R中的行或列(使用dplyr)?

use*_*667 1 r dplyr

原始标题"如何替换字符串"更新为"修复编码",因为这是这里回答的问题.

> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] countrycode_0.17 dplyr_0.2       

loaded via a namespace (and not attached):
[1] assertthat_0.1 magrittr_1.0.1 parallel_3.1.1 Rcpp_0.11.3    tools_3.1.1 
Run Code Online (Sandbox Code Playgroud)

从Web源导入期间,我有一个包含某些错误的数据框.我正在寻找那些我认为正确的字符串替换,我正在学习R和dplyr所以知道如何做到这一点可能会帮助我解决更大的清理数据问题.

请在第20和第31行找到带错误的图像,我们看到"Universitat"而不是"Universitat"和"LinkÃping"而不是"Linkaping"

我知道我可以查找行和列并尝试替换它们,但是如果数据帧或数据集较大,我将无法找到所有实例.

因为我知道差异,我想要搜索单词本身并替换它.只是这个词.我知道它是字符串的一部分.但是我仍然只想处理字符串的那一部分,我可以这样做吗?

任何帮助表示赞赏,如果你能为这个问题所有可能的方法和解决方案既简单又复杂,我也将不胜感激,这将有助于我更快地学习,谢谢,在此输入图像描述

                          name     country
1                             TU Dortmund     Germany
2                             TU Dortmund     Germany
3                   Maastricht University Netherlands
4         University of the Fraser Valley      Canada
5                      Queen's University      Canada
6                       Aarhus University     Denmark
7                   University Of Alberta      Canada
8                       Deakin University   Australia
9                    Macquarie University   Australia
10 National University Of Ireland, Galway     Ireland
11                      Vienna University     Austria
12       National University of Singapore   Singapore
13                     Erasmus University Netherlands
14          Radboud Universiteit Nijmegen Netherlands
15           Vrije Universiteit Amsterdam Netherlands
16                    University of Otago New Zealand
17            National College Of Ireland     Ireland
18                University College Cork     Ireland
19             Irish Management Institute     Ireland
20                  Universität Konstanz     Germany
21 Otto Von Guericke University Magdeburg     Germany
22        University of Technology Sydney   Australia
23                 Dublin City University     Ireland
24 Institute Of Technology Blanchardstown     Ireland
25      Kth Royal Institute Of Technology      Sweden
26                       Aalto University     Finland
27                     Dalarna University      Sweden
28                 University Of Helsinki     Finland
29                      Aarhus University     Denmark
30              University College Dublin     Ireland
31                  Linköping University      Sweden
32                     Aalborg University     Denmark
33         Dublin Institute Of Technology     Ireland
34                        York University      Canada
35                  Maastricht University Netherlands
36                     Utrecht University Netherlands
Run Code Online (Sandbox Code Playgroud)

akr*_*run 5

你可以通过几种方式纠正这个问题.

  1. 使用正确的编码(UTF-8)读取文件

    read.csv2(file("filename.csv", encoding="UTF-8"))
    
    Run Code Online (Sandbox Code Playgroud)
  2. 读取文件后,应用函数转换为UTF-8编码

    library(stringi)
    df[] <- lapply(df, function(x) stri_encode(x, "", "UTF-8"))
    
    Run Code Online (Sandbox Code Playgroud)