将城市名称和地理位置数据添加到数据框

Question

将城市名称和地理位置数据添加到数据框

我有一个包含超过20.000个观测值的数据集,基本上看起来像这样:

df <- data.frame(
    user = c("ABC", "DEF", "GHI"),
    location = c("Chicago, the windy city", "Oxford University", "Paris")
)

Run Code Online (Sandbox Code Playgroud)

我要添加其他三列city,long,lat并填写这些列与该城市名称,而geolocations(经度和纬度).

因此我想使用maps包及其world.cities数据库:

library(maps)
data(world.cities)

Run Code Online (Sandbox Code Playgroud)

如果城市名称以location正确的方式显示,则添加城市名称和地理位置并不困难.然而,他们中的大多数确实有其他字符串(例如"芝加哥,多风的城市").

如何根据world.cities数据库提取城市名称,并将真实的城市名称写入列city和地理定位到long和lat？

Answer 1

Ste*_*pré 11

按照在评论中提到的@Heroka,如果城市的名称总是第一个字符串location,你可以使用提取的第一个字符串stringi,left_join该world.cities数据,并在比赛中人口最多的过滤器.

library(stringi)
library(dplyr)

df %>%
  mutate(city = stri_extract_first_words(location)) %>%
  left_join(world.cities, by = c("city" = "name")) %>%
  group_by(city) %>%
  filter(row_number(desc(pop)) == 1)

Run Code Online (Sandbox Code Playgroud)

这使:

#Source: local data frame [3 x 8]
#Groups: city [3]
#
#    user                location    city country.etc     pop   lat   long capital
#  (fctr)                  (fctr)   (chr)       (chr)   (int) (dbl)  (dbl)   (int)
#1    ABC Chicago, the windy city Chicago         USA 2830144 41.84 -87.68       0
#2    DEF       Oxford University  Oxford          UK  157568 51.76  -1.26       0
#3    GHI                   Paris   Paris      France 2141839 48.86   2.34       1

Run Code Online (Sandbox Code Playgroud)

更新

如果城市的名称并不总是第一个字符串location,您可以首先尝试将字词location与字典(此处为nameworld.cities中的列)匹配,然后使用返回的匹配TRUE作为您的位置名称.这是一个快速实现(我向您添加了"伦敦大学学院"案例数据框)

> df
#  user                  location
#1  ABC   Chicago, the windy city
#2  DEF         Oxford University
#3  GHI                     Paris
#4  JKL University College London

Run Code Online (Sandbox Code Playgroud)

对于每一行,我们提取所有的单词location,并将它们存储在列表中lst,遍历它来寻找匹配的位置name在world.cities,它在存储p,最后提取对应于位置的元素p中lst,并存储在city

df %>%
  mutate(lst = stri_extract_all_words(location),
         p = sapply(lst, function (x) which(x %in% world.cities$name), simplify=TRUE)) %>%
  mutate(city = sapply(1:length(lst), function(x) .$lst[[x]][.$p[x]])) %>%
  left_join(world.cities, by = c("city" = "name")) %>%
  group_by(city) %>%
  filter(row_number(desc(pop)) == 1)

Run Code Online (Sandbox Code Playgroud)

您还可以删除临时列p并lst添加... %>% select(-lst, -p)

更新2

这不应该破坏格式错误的单词,但不适用于"纽约"案例:

df %>%
  mutate(
    city = lapply(stri_extract_all_words(location), 
                  function (x) { world.cities$name[match(x, world.cities$name)] })) %>%
  tidyr::unnest(city) %>%
  filter(!is.na(city)) %>%
  left_join(world.cities, by = c("city" = "name")) %>%
  group_by(city) %>%
  filter(row_number(desc(pop)) == 1)

Run Code Online (Sandbox Code Playgroud)

更新3

这适用于所有情况:

> df
#  user                  location
#1  ABC   Chicago, the windy city
#2  DEF         Oxford University
#3  GHI                     Paris
#4  JKL                  New York
#5  MNO                  m0ntr3al
#6  PQR University College London

df$l <- gsub("[^[:alnum:]]+", " ", df$location)
lst  <- lapply(world.cities$name, function (x) { grep(x, df$l, value = TRUE) })
m    <- data.table::melt(lst)

df %>% 
  left_join(m, by = c("l" = "value")) %>%
  left_join(world.cities %>% 
              add_rownames %>% 
              mutate(rowname = as.numeric(rowname)), 
            by = c("L1" = "rowname")) %>% 
  tidyr::replace_na(list(pop = 0)) %>%
  group_by(location) %>%
  filter(row_number(desc(pop)) == 1) %>%
  select(-(l:L1))

Run Code Online (Sandbox Code Playgroud)

这使:

#Source: local data frame [6 x 8]
#Groups: location [6]
#
#    user                  location     name country.etc     pop   lat   long capital
#  (fctr)                    (fctr)    (chr)       (chr)   (dbl) (dbl)  (dbl)   (int)
#1    ABC   Chicago, the windy city  Chicago         USA 2830144 41.84 -87.68       0
#2    DEF         Oxford University   Oxford          UK  157568 51.76  -1.26       0
#3    GHI                     Paris    Paris      France 2141839 48.86   2.34       1
#4    JKL                  New York New York         USA 8124427 40.67 -73.94       0
#5    MNO                  m0ntr3al       NA          NA       0    NA     NA      NA
#6    PQR Univeristy College London   London          UK 7489022 51.52  -0.10       1

Run Code Online (Sandbox Code Playgroud)

归档时间：	10 年，4 月前
查看次数：	231 次
最近记录：	10 年，4 月前