fed*_*r80 2 regex maps r geolocation
我有一个包含超过20.000个观测值的数据集,基本上看起来像这样:
df <- data.frame(
user = c("ABC", "DEF", "GHI"),
location = c("Chicago, the windy city", "Oxford University", "Paris")
)
Run Code Online (Sandbox Code Playgroud)
我要添加其他三列city,long,lat并填写这些列与该城市名称,而geolocations(经度和纬度).
因此我想使用maps包及其world.cities数据库:
library(maps)
data(world.cities)
Run Code Online (Sandbox Code Playgroud)
如果城市名称以location正确的方式显示,则添加城市名称和地理位置并不困难.然而,他们中的大多数确实有其他字符串(例如"芝加哥,多风的城市").
如何根据world.cities数据库提取城市名称,并将真实的城市名称写入列city和地理定位到long和lat?
Ste*_*pré 11
按照在评论中提到的@Heroka,如果城市的名称总是第一个字符串location,你可以使用提取的第一个字符串stringi,left_join该world.cities数据,并在比赛中人口最多的过滤器.
library(stringi)
library(dplyr)
df %>%
mutate(city = stri_extract_first_words(location)) %>%
left_join(world.cities, by = c("city" = "name")) %>%
group_by(city) %>%
filter(row_number(desc(pop)) == 1)
Run Code Online (Sandbox Code Playgroud)
这使:
#Source: local data frame [3 x 8]
#Groups: city [3]
#
# user location city country.etc pop lat long capital
# (fctr) (fctr) (chr) (chr) (int) (dbl) (dbl) (int)
#1 ABC Chicago, the windy city Chicago USA 2830144 41.84 -87.68 0
#2 DEF Oxford University Oxford UK 157568 51.76 -1.26 0
#3 GHI Paris Paris France 2141839 48.86 2.34 1
Run Code Online (Sandbox Code Playgroud)
更新
如果城市的名称并不总是第一个字符串location,您可以首先尝试将字词location与字典(此处为nameworld.cities中的列)匹配,然后使用返回的匹配TRUE作为您的位置名称.这是一个快速实现(我向您添加了"伦敦大学学院"案例数据框)
> df
# user location
#1 ABC Chicago, the windy city
#2 DEF Oxford University
#3 GHI Paris
#4 JKL University College London
Run Code Online (Sandbox Code Playgroud)
对于每一行,我们提取所有的单词location,并将它们存储在列表中lst,遍历它来寻找匹配的位置name在world.cities,它在存储p,最后提取对应于位置的元素p中lst,并存储在city
df %>%
mutate(lst = stri_extract_all_words(location),
p = sapply(lst, function (x) which(x %in% world.cities$name), simplify=TRUE)) %>%
mutate(city = sapply(1:length(lst), function(x) .$lst[[x]][.$p[x]])) %>%
left_join(world.cities, by = c("city" = "name")) %>%
group_by(city) %>%
filter(row_number(desc(pop)) == 1)
Run Code Online (Sandbox Code Playgroud)
您还可以删除临时列p并lst添加... %>% select(-lst, -p)
更新2
这不应该破坏格式错误的单词,但不适用于"纽约"案例:
df %>%
mutate(
city = lapply(stri_extract_all_words(location),
function (x) { world.cities$name[match(x, world.cities$name)] })) %>%
tidyr::unnest(city) %>%
filter(!is.na(city)) %>%
left_join(world.cities, by = c("city" = "name")) %>%
group_by(city) %>%
filter(row_number(desc(pop)) == 1)
Run Code Online (Sandbox Code Playgroud)
更新3
这适用于所有情况:
> df
# user location
#1 ABC Chicago, the windy city
#2 DEF Oxford University
#3 GHI Paris
#4 JKL New York
#5 MNO m0ntr3al
#6 PQR University College London
df$l <- gsub("[^[:alnum:]]+", " ", df$location)
lst <- lapply(world.cities$name, function (x) { grep(x, df$l, value = TRUE) })
m <- data.table::melt(lst)
df %>%
left_join(m, by = c("l" = "value")) %>%
left_join(world.cities %>%
add_rownames %>%
mutate(rowname = as.numeric(rowname)),
by = c("L1" = "rowname")) %>%
tidyr::replace_na(list(pop = 0)) %>%
group_by(location) %>%
filter(row_number(desc(pop)) == 1) %>%
select(-(l:L1))
Run Code Online (Sandbox Code Playgroud)
这使:
#Source: local data frame [6 x 8]
#Groups: location [6]
#
# user location name country.etc pop lat long capital
# (fctr) (fctr) (chr) (chr) (dbl) (dbl) (dbl) (int)
#1 ABC Chicago, the windy city Chicago USA 2830144 41.84 -87.68 0
#2 DEF Oxford University Oxford UK 157568 51.76 -1.26 0
#3 GHI Paris Paris France 2141839 48.86 2.34 1
#4 JKL New York New York USA 8124427 40.67 -73.94 0
#5 MNO m0ntr3al NA NA 0 NA NA NA
#6 PQR Univeristy College London London UK 7489022 51.52 -0.10 1
Run Code Online (Sandbox Code Playgroud)