R:使用plyr在两个数据源的匹配子集之间执行模糊字符串匹配

mcj*_*udd 5 r fuzzy-comparison plyr dplyr

假设我有一系列具有不同拼写错误或其他问题的县,这些问题与2010 FIPS数据集(fips下面创建数据框的代码)区别开来,但正确输入拼写错误的县所在的州.这是sample我的完整数据集中的21个随机观察结果:

tomatch <- structure(list(county = c("Beauregard", "De Soto", "Dekalb", "Webster",
                                     "Saint Joseph", "West Feliciana", "Ketchikan Gateway", "Evangeline", 
                                     "Richmond City", "Saint Mary", "Saint Louis City", "Mclean", 
                                     "Union", "Bienville", "Covington City", "Martinsville City", 
                                     "Claiborne", "King And Queen", "Mclean", "Mcminn", "Prince Georges"
), state = c("LA", "LA", "GA", "LA", "IN", "LA", "AK", "LA", "VA", 
             "LA", "MO", "KY", "LA", "LA", "VA", "VA", "LA", "VA", "ND", "TN", 
             "MD")), .Names = c("county", "state"), class = c("tbl_df", "data.frame"
             ), row.names = c(NA, -21L))

              county state
1         Beauregard    LA
2            De Soto    LA
3             Dekalb    GA
4            Webster    LA
5       Saint Joseph    IN
6     West Feliciana    LA
7  Ketchikan Gateway    AK
8         Evangeline    LA
9      Richmond City    VA
10        Saint Mary    LA
11  Saint Louis City    MO
12            Mclean    KY
13             Union    LA
14         Bienville    LA
15    Covington City    VA
16 Martinsville City    VA
17         Claiborne    LA
18    King And Queen    VA
19            Mclean    ND
20            Mcminn    TN
21    Prince Georges    MD
Run Code Online (Sandbox Code Playgroud)

我曾经adist创建了一个模糊的字符串匹配算法,它将我县中80%的县与县名相匹配fips.然而,有时它会匹配两个拼写相似的县,但来自不同的州(例如,"Webster,LA"与"Webster,GA"相匹配,而不是"Webster Parrish,LA").

distance <- adist(tomatch$county, 
                  fips$countyname, 
                  partial = TRUE)


min.name <- apply(distance, 1, min)

matchedcounties <- NULL  

for(i in 1:nrow(distance)) {

  s2.i <- match(min.name[i], distance[i, ])
  s1.i <- i

  matchedcounties <- rbind(data.frame(s2.i = s2.i,
                                      s1.i = s1.i,
                                      s1name = tomatch[s1.i, ]$county, 
                                      s2name = fips[s2.i, ]$countyname, 
                                      adist = min.name[i]),
                           matchedcounties)

}
Run Code Online (Sandbox Code Playgroud)

因此,我想将县的模糊字符串匹配限制为具有匹配状态的正确拼写版本.

我当前的算法制作了一个大矩阵,用于计算两个源之间的标准Levenshtein距离,然后选择具有最小距离的值.

为了解决我的问题,我猜我需要创建一个可以应用于每个'状态'组的函数ddply,但是我很困惑我应该如何指示ddply函数中的组值应该与另一个相匹配数据帧.甲dplyr使用任何其他包溶液或溶液将以及理解.

用于创建FIPS数据集的代码:

download.file('http://www2.census.gov/geo/docs/reference/codes/files/national_county.txt',
              './nationalfips.txt')

fips <- read.csv('./nationalfips.txt', 
                 stringsAsFactors = FALSE, colClasses = 'character', header = FALSE)
names(fips) <- c('state', 'statefips', 'countyfips', 'countyname', 'classfips')

# remove 'County' from countyname
fips$countyname <- sub('County', '', fips$countyname, fixed = TRUE)
fips$countyname <- stringr::str_trim(fips$countyname)
Run Code Online (Sandbox Code Playgroud)

Mat*_*ttV 3

这是 dplyr 的一种方法。我首先按tomatch州将 data.frame 与 FIPS 名称连接起来(仅允许州内匹配):

require(dplyr)
df <- tomatch %>% 
  left_join(fips, by="state")
Run Code Online (Sandbox Code Playgroud)

接下来,我注意到很多县没有“圣”,而是“圣”。在 FIPS 数据集中。首先清理干净应该可以改善获得的结果。

df <- df %>%
    mutate(county_clean = gsub("Saint", "St.", county))
Run Code Online (Sandbox Code Playgroud)

然后,按县对这个 data.frame 进行分组,并使用 adist 计算距离:

df <- df %>%
  group_by(county_clean) %>%                # Calculate the distance per county
  mutate(dist = diag(adist(county_clean, countyname, partial=TRUE))) %>%
  arrange(county, dist) # Used this for visual inspection.
Run Code Online (Sandbox Code Playgroud)

请注意,我从结果矩阵中取出对角线,因为 adist 返回一个 nxm 矩阵,其中 n 代表 x 向量,m 代表 y 向量(它计算所有组合)。或者,您可以添加 agrep 结果:

df <- df %>%
  rowwise() %>% # 'group_by' a single row. 
  mutate(agrep_result = agrepl(county_clean, countyname, max.distance = 0.3)) %>%
  ungroup()   # Always a good idea to remove 'groups' after you're done.
Run Code Online (Sandbox Code Playgroud)

然后像之前一样进行过滤,取最小距离:

df <- df %>%
  group_by(county_clean) %>%   # Causes it to calculate the 'min' per group
  filter(dist == min(dist)) %>%
  ungroup()
Run Code Online (Sandbox Code Playgroud)

请注意,这可能会导致为 中的每一输入行返回多行tomatch
或者,在一次运行中完成所有操作(一旦我确信它正在执行它应该执行的操作,我通常会将代码更改为这种格式):

df <- tomatch %>% 
  # Join on all names in the relevant state and clean 'St.'
  left_join(fips, by="state") %>%
  mutate(county_clean = gsub("Saint", "St.", county)) %>% 

  # Calculate the distances, per original county name.
  group_by(county_clean) %>%                
  mutate(dist = diag(adist(county_clean, countyname, partial=TRUE))) %>%

  # Append the agrepl result
  rowwise() %>%
  mutate(string_agrep = agrepl(county_clean, countyname, max.distance = 0.3)) %>%
  ungroup() %>%  

  # Only retain minimum distances
  group_by(county_clean) %>%   
  filter(dist == min(dist))
Run Code Online (Sandbox Code Playgroud)

两种情况的结果:

              county      county_clean state                countyname dist string_agrep
1         Beauregard        Beauregard    LA         Beauregard Parish    0         TRUE
2            De Soto           De Soto    LA            De Soto Parish    0         TRUE
3             Dekalb            Dekalb    GA                    DeKalb    1         TRUE
4            Webster           Webster    LA            Webster Parish    0         TRUE
5       Saint Joseph        St. Joseph    IN                St. Joseph    0         TRUE
6     West Feliciana    West Feliciana    LA     West Feliciana Parish    0         TRUE
7  Ketchikan Gateway Ketchikan Gateway    AK Ketchikan Gateway Borough    0         TRUE
8         Evangeline        Evangeline    LA         Evangeline Parish    0         TRUE
9      Richmond City     Richmond City    VA             Richmond city    1         TRUE
10        Saint Mary          St. Mary    LA           St. Mary Parish    0         TRUE
11  Saint Louis City    St. Louis City    MO            St. Louis city    1         TRUE
12            Mclean            Mclean    KY                    McLean    1         TRUE
13             Union             Union    LA              Union Parish    0         TRUE
14         Bienville         Bienville    LA          Bienville Parish    0         TRUE
15    Covington City    Covington City    VA            Covington city    1         TRUE
16 Martinsville City Martinsville City    VA         Martinsville city    1         TRUE
17         Claiborne         Claiborne    LA          Claiborne Parish    0         TRUE
18    King And Queen    King And Queen    VA            King and Queen    1         TRUE
19            Mclean            Mclean    ND                    McLean    1         TRUE
20            Mcminn            Mcminn    TN                    McMinn    1         TRUE
21    Prince Georges    Prince Georges    MD           Prince George's    1         TRU  
Run Code Online (Sandbox Code Playgroud)