mcj*_*udd 5 r fuzzy-comparison plyr dplyr
假设我有一系列具有不同拼写错误或其他问题的县,这些问题与2010 FIPS数据集(fips下面创建数据框的代码)区别开来,但正确输入拼写错误的县所在的州.这是sample我的完整数据集中的21个随机观察结果:
tomatch <- structure(list(county = c("Beauregard", "De Soto", "Dekalb", "Webster",
"Saint Joseph", "West Feliciana", "Ketchikan Gateway", "Evangeline",
"Richmond City", "Saint Mary", "Saint Louis City", "Mclean",
"Union", "Bienville", "Covington City", "Martinsville City",
"Claiborne", "King And Queen", "Mclean", "Mcminn", "Prince Georges"
), state = c("LA", "LA", "GA", "LA", "IN", "LA", "AK", "LA", "VA",
"LA", "MO", "KY", "LA", "LA", "VA", "VA", "LA", "VA", "ND", "TN",
"MD")), .Names = c("county", "state"), class = c("tbl_df", "data.frame"
), row.names = c(NA, -21L))
county state
1 Beauregard LA
2 De Soto LA
3 Dekalb GA
4 Webster LA
5 Saint Joseph IN
6 West Feliciana LA
7 Ketchikan Gateway AK
8 Evangeline LA
9 Richmond City VA
10 Saint Mary LA
11 Saint Louis City MO
12 Mclean KY
13 Union LA
14 Bienville LA
15 Covington City VA
16 Martinsville City VA
17 Claiborne LA
18 King And Queen VA
19 Mclean ND
20 Mcminn TN
21 Prince Georges MD
Run Code Online (Sandbox Code Playgroud)
我曾经adist创建了一个模糊的字符串匹配算法,它将我县中80%的县与县名相匹配fips.然而,有时它会匹配两个拼写相似的县,但来自不同的州(例如,"Webster,LA"与"Webster,GA"相匹配,而不是"Webster Parrish,LA").
distance <- adist(tomatch$county,
fips$countyname,
partial = TRUE)
min.name <- apply(distance, 1, min)
matchedcounties <- NULL
for(i in 1:nrow(distance)) {
s2.i <- match(min.name[i], distance[i, ])
s1.i <- i
matchedcounties <- rbind(data.frame(s2.i = s2.i,
s1.i = s1.i,
s1name = tomatch[s1.i, ]$county,
s2name = fips[s2.i, ]$countyname,
adist = min.name[i]),
matchedcounties)
}
Run Code Online (Sandbox Code Playgroud)
因此,我想将县的模糊字符串匹配限制为具有匹配状态的正确拼写版本.
我当前的算法制作了一个大矩阵,用于计算两个源之间的标准Levenshtein距离,然后选择具有最小距离的值.
为了解决我的问题,我猜我需要创建一个可以应用于每个'状态'组的函数ddply,但是我很困惑我应该如何指示ddply函数中的组值应该与另一个相匹配数据帧.甲dplyr使用任何其他包溶液或溶液将以及理解.
用于创建FIPS数据集的代码:
download.file('http://www2.census.gov/geo/docs/reference/codes/files/national_county.txt',
'./nationalfips.txt')
fips <- read.csv('./nationalfips.txt',
stringsAsFactors = FALSE, colClasses = 'character', header = FALSE)
names(fips) <- c('state', 'statefips', 'countyfips', 'countyname', 'classfips')
# remove 'County' from countyname
fips$countyname <- sub('County', '', fips$countyname, fixed = TRUE)
fips$countyname <- stringr::str_trim(fips$countyname)
Run Code Online (Sandbox Code Playgroud)
这是 dplyr 的一种方法。我首先按tomatch州将 data.frame 与 FIPS 名称连接起来(仅允许州内匹配):
require(dplyr)
df <- tomatch %>%
left_join(fips, by="state")
Run Code Online (Sandbox Code Playgroud)
接下来,我注意到很多县没有“圣”,而是“圣”。在 FIPS 数据集中。首先清理干净应该可以改善获得的结果。
df <- df %>%
mutate(county_clean = gsub("Saint", "St.", county))
Run Code Online (Sandbox Code Playgroud)
然后,按县对这个 data.frame 进行分组,并使用 adist 计算距离:
df <- df %>%
group_by(county_clean) %>% # Calculate the distance per county
mutate(dist = diag(adist(county_clean, countyname, partial=TRUE))) %>%
arrange(county, dist) # Used this for visual inspection.
Run Code Online (Sandbox Code Playgroud)
请注意,我从结果矩阵中取出对角线,因为 adist 返回一个 nxm 矩阵,其中 n 代表 x 向量,m 代表 y 向量(它计算所有组合)。或者,您可以添加 agrep 结果:
df <- df %>%
rowwise() %>% # 'group_by' a single row.
mutate(agrep_result = agrepl(county_clean, countyname, max.distance = 0.3)) %>%
ungroup() # Always a good idea to remove 'groups' after you're done.
Run Code Online (Sandbox Code Playgroud)
然后像之前一样进行过滤,取最小距离:
df <- df %>%
group_by(county_clean) %>% # Causes it to calculate the 'min' per group
filter(dist == min(dist)) %>%
ungroup()
Run Code Online (Sandbox Code Playgroud)
请注意,这可能会导致为 中的每一输入行返回多行tomatch。
或者,在一次运行中完成所有操作(一旦我确信它正在执行它应该执行的操作,我通常会将代码更改为这种格式):
df <- tomatch %>%
# Join on all names in the relevant state and clean 'St.'
left_join(fips, by="state") %>%
mutate(county_clean = gsub("Saint", "St.", county)) %>%
# Calculate the distances, per original county name.
group_by(county_clean) %>%
mutate(dist = diag(adist(county_clean, countyname, partial=TRUE))) %>%
# Append the agrepl result
rowwise() %>%
mutate(string_agrep = agrepl(county_clean, countyname, max.distance = 0.3)) %>%
ungroup() %>%
# Only retain minimum distances
group_by(county_clean) %>%
filter(dist == min(dist))
Run Code Online (Sandbox Code Playgroud)
两种情况的结果:
county county_clean state countyname dist string_agrep
1 Beauregard Beauregard LA Beauregard Parish 0 TRUE
2 De Soto De Soto LA De Soto Parish 0 TRUE
3 Dekalb Dekalb GA DeKalb 1 TRUE
4 Webster Webster LA Webster Parish 0 TRUE
5 Saint Joseph St. Joseph IN St. Joseph 0 TRUE
6 West Feliciana West Feliciana LA West Feliciana Parish 0 TRUE
7 Ketchikan Gateway Ketchikan Gateway AK Ketchikan Gateway Borough 0 TRUE
8 Evangeline Evangeline LA Evangeline Parish 0 TRUE
9 Richmond City Richmond City VA Richmond city 1 TRUE
10 Saint Mary St. Mary LA St. Mary Parish 0 TRUE
11 Saint Louis City St. Louis City MO St. Louis city 1 TRUE
12 Mclean Mclean KY McLean 1 TRUE
13 Union Union LA Union Parish 0 TRUE
14 Bienville Bienville LA Bienville Parish 0 TRUE
15 Covington City Covington City VA Covington city 1 TRUE
16 Martinsville City Martinsville City VA Martinsville city 1 TRUE
17 Claiborne Claiborne LA Claiborne Parish 0 TRUE
18 King And Queen King And Queen VA King and Queen 1 TRUE
19 Mclean Mclean ND McLean 1 TRUE
20 Mcminn Mcminn TN McMinn 1 TRUE
21 Prince Georges Prince Georges MD Prince George's 1 TRU
Run Code Online (Sandbox Code Playgroud)