R:将相似地址组合在一起

rsy*_*ian 8 r qdap dplyr stringdist tidyverse

我有一个 400,000 行的文件,其中包含需要进行地理编码的手动输入地址。文件中的相同地址有很多不同的变体,因此多次对同一地址使用 API 调用似乎很浪费。

为了减少这种情况,我想减少这五行:

    Address
    1 Main Street, Country A, World
    1 Main St, Country A, World
    1 Maine St, Country A, World
    2 Side Street, Country A, World
    2 Side St. Country A, World
Run Code Online (Sandbox Code Playgroud)

下降到两个:

    Address
    1 Main Street, Country A, World
    2 Side Street, Country A, World
Run Code Online (Sandbox Code Playgroud)

使用该stringdist包,您可以将字符串的“单词”部分组合在一起,但字符串匹配算法不区分数字。这意味着它将同一街道上的两个不同房屋号码归为同一地址。

为了解决这个问题,我想出了两种方法:首先,尝试使用正则表达式将数字和地址手动分离到单独的列中,然后重新加入它们。这样做的问题是,有这么多手动输入的地址,似乎有数百种不同的边缘情况,而且它变得笨拙。

使用这个关于分组的答案和这个单词转换为数字的答案,我有第二种方法来处理边缘情况,但在计算上非常昂贵。有没有更好的第三种方法来做到这一点?

library(gsubfn)
library(english)
library(qdap)
library(stringdist)
library(tidyverse)


similarGroups <- function(x, thresh = 0.8, method = "lv"){
  grp <- integer(length(x))
  Address <- x
  x <- tolower(x)
  for(i in seq_along(Address)){
    if(!is.na(Address[i])){
      sim <- stringdist::stringsim(x[i], x, method = method)
      k <- which(sim > thresh & !is.na(Address))
      grp[k] <- i
      is.na(Address) <- k
    }
  }
  grp
}

df <- data.frame(Address = c("1 Main Street, Country A, World", 
                             "1 Main St, Country A, World", 
                             "1 Maine St, Country A, World", 
                             "2 Side Street, Country A, World", 
                             "2 Side St. Country A, World"))

df1 <- df %>%
  # Converts Numbers into Letters
  mutate(Address = replace_number(Address),
         # Groups Similar Addresses Together
         Address = Address[similarGroups(Address, thresh = 0.8, method = "lv")],
         # Converts Letters back into Numbers
         Address = gsubfn("\\w+", setNames(as.list(1:1000), as.english(1:1000)), Address)
  ) %>%
  # Removes the Duplicates
  unique()
Run Code Online (Sandbox Code Playgroud)

Wal*_*ldi 5

stringdist::stringsimmatrix 允许比较字符串之间的相似性:

library(dplyr)
library(stringr)
df <- data.frame(Address = c("1 Main Street, Country A, World", 
                             "1 Main St, Country A, World", 
                             "3 Main St, Country A, World", 
                             "2 Side Street, Country A, World", 
                             "2 Side St. PO 5678 Country A,  World"))
                             
stringdist::stringsimmatrix(df$Address)
          1         2         3         4         5
1 1.0000000 0.8709677 0.8387097 0.8387097 0.5161290
2 0.8518519 1.0000000 0.9629630 0.6666667 0.4444444
3 0.8148148 0.9629630 1.0000000 0.6666667 0.4444444
4 0.8387097 0.7096774 0.7096774 1.0000000 0.6774194
5 0.5833333 0.5833333 0.5833333 0.7222222 1.0000000
Run Code Online (Sandbox Code Playgroud)

正如您所指出的,在上面的示例中,根据此标准,第 2 行和第 3 行非常相似 (96%),而门牌号不同。

您可以添加另一个标准,从字符串中提取数字,并比较它们的相似性:

# Extract numbers
nums <- df %>% rowwise %>% mutate(numlist = str_extract_all(Address,"\\(?[0-9]+\\)?"))  

# Create numbers vectors pairs
numpairs <- expand.grid(nums$numlist, nums$numlist)

# Calculate similarity
numsim <- numpairs %>% rowwise %>% mutate(dist = length(intersect(Var1,Var2)) / length(union(Var1,Var2)))

# Return similarity matrix
matrix(numsim$dist,nrow(df))

     [,1] [,2] [,3] [,4] [,5]
[1,]    1    1    0  0.0  0.0
[2,]    1    1    0  0.0  0.0
[3,]    0    0    1  0.0  0.0
[4,]    0    0    0  1.0  0.5
[5,]    0    0    0  0.5  1.0
Run Code Online (Sandbox Code Playgroud)

根据这个新标准,第 2 行和第 3 行明显不同。

您可以结合这两个标准来确定地址是否足够相似,例如:

matrix(numsim$dist,nrow(df)) * stringdist::stringsimmatrix(df$Address)

          1         2 3         4         5
1 1.0000000 0.8709677 0 0.0000000 0.0000000
2 0.8518519 1.0000000 0 0.0000000 0.0000000
3 0.0000000 0.0000000 1 0.0000000 0.0000000
4 0.0000000 0.0000000 0 1.0000000 0.3387097
5 0.0000000 0.0000000 0 0.3611111 1.0000000
Run Code Online (Sandbox Code Playgroud)

要处理数十万个地址,expand.grid对整个数据集不起作用,但您可以按国家/地区拆分/并行化它,以避免不可行的完整笛卡尔积。


cia*_*ovx 4

可能需要研究一下 OpenRefine 或refinrR 包,它的视觉效果要差得多,但仍然很好。它有两个函数,key_collision_merge并且n_gram_merge有几个参数。如果您有一个好的地址字典,您可以将其传递给key_collision_merge.

最好记下您经常看到的缩写(St.、Blvd.、Rd. 等)并替换所有这些缩写。当然,这些缩写有一个很好的表格,例如https://www.pb.com/docs/US/pdf/SIS/Mail-Services/USPS-Suffix-Abbreviations.pdf

然后:

library(refinr)    
df <- tibble(Address = c("1 Main Street, Country A, World", 
                             "1 Main St, Country A, World", 
                             "1 Maine St, Country A, World", 
                             "2 Side Street, Country A, World", 
                             "2 Side St. Country A, World",
                              "3 Side Rd. Country A, World",
                              "3 Side Road Country B World"))
df2 <- df %>%
  mutate(address_fix = str_replace_all(Address, "St\\.|St\\,|St\\s", "Street"),
         address_fix = str_replace_all(address_fix, "Rd\\.|Rd\\,|Rd\\s", "Road")) %>%
  mutate(address_merge = n_gram_merge(address_fix, numgram = 1))

df2$address_merge
[1] "1 Main Street Country A, World"
[2] "1 Main Street Country A, World"
[3] "1 Main Street Country A, World"
[4] "2 Side Street Country A, World"
[5] "2 Side Street Country A, World"
[6] "3 Side Road Country A, World"  
[7] "3 Side Road Country B World"   
Run Code Online (Sandbox Code Playgroud)

  • “postmanr”包包含街道后缀的字典。它针对美国地址进行了优化,因此不确定它对于您提供的格式(街道、国家、世界)的效果如何,但可能值得研究一下 https://slu-opengis.github.io/postmastr/articles/postmastr .html 和 https://github.com/slu-openGIS/postmastr `remotes::install_github("slu-openGIS/postmastr") View(dic_us_suffix) ` (2认同)