Dav*_*ave 15 python geopy pandas
我有一个df:
import pandas as pd
import numpy as np
import datetime as DT
import hmac
from geopy.geocoders import Nominatim
from geopy.distance import vincenty
df
     city_name  state_name  county_name
0    WASHINGTON  DC  DIST OF COLUMBIA
1    WASHINGTON  DC  DIST OF COLUMBIA
2    WASHINGTON  DC  DIST OF COLUMBIA
3    WASHINGTON  DC  DIST OF COLUMBIA
4    WASHINGTON  DC  DIST OF COLUMBIA
5    WASHINGTON  DC  DIST OF COLUMBIA
6    WASHINGTON  DC  DIST OF COLUMBIA
7    WASHINGTON  DC  DIST OF COLUMBIA
8    WASHINGTON  DC  DIST OF COLUMBIA
9    WASHINGTON  DC  DIST OF COLUMBIA
Run Code Online (Sandbox Code Playgroud)
我想得到下面数据框中任何一列的纬度和经度坐标.在处理各个位置的文档时,文档(http://geopy.readthedocs.org/en/latest/#data)非常简单.
>>> from geopy.geocoders import Nominatim
>>> geolocator = Nominatim()
>>> location = geolocator.geocode("175 5th Avenue NYC")
>>> print(location.address)
Flatiron Building, 175, 5th Avenue, Flatiron, New York, NYC, New York,     ...
>>> print((location.latitude, location.longitude))
(40.7410861, -73.9896297241625)
>>> print(location.raw)
{'place_id': '9167009604', 'type': 'attraction', ...}
Run Code Online (Sandbox Code Playgroud)
但是我想将函数应用于df中的每一行并创建一个新列.我尝试了以下内容
df['city_coord'] = geolocator.geocode(lambda row: 'state_name' (row))
Run Code Online (Sandbox Code Playgroud)
但我想我的代码中缺少一些东西,因为我得到以下内容:
    city_name   state_name  county_name coordinates
0    WASHINGTON  DC  DIST OF COLUMBIA    None
1    WASHINGTON  DC  DIST OF COLUMBIA    None
2    WASHINGTON  DC  DIST OF COLUMBIA    None
3    WASHINGTON  DC  DIST OF COLUMBIA    None
4    WASHINGTON  DC  DIST OF COLUMBIA    None
5    WASHINGTON  DC  DIST OF COLUMBIA    None
6    WASHINGTON  DC  DIST OF COLUMBIA    None
7    WASHINGTON  DC  DIST OF COLUMBIA    None
8    WASHINGTON  DC  DIST OF COLUMBIA    None
9    WASHINGTON  DC  DIST OF COLUMBIA    None
Run Code Online (Sandbox Code Playgroud)
希望使用Lambda函数我希望这样的东西:
     city_name  state_name  county_name  city_coord
0    WASHINGTON  DC  DIST OF COLUMBIA    38.8949549, -77.0366456 
1    WASHINGTON  DC  DIST OF COLUMBIA    38.8949549, -77.0366456 
2    WASHINGTON  DC  DIST OF COLUMBIA    38.8949549, -77.0366456 
3    WASHINGTON  DC  DIST OF COLUMBIA    38.8949549, -77.0366456 
4    WASHINGTON  DC  DIST OF COLUMBIA    38.8949549, -77.0366456 
5    WASHINGTON  DC  DIST OF COLUMBIA    38.8949549, -77.0366456 
6    WASHINGTON  DC  DIST OF COLUMBIA    38.8949549, -77.0366456 
7    WASHINGTON  DC  DIST OF COLUMBIA    38.8949549, -77.0366456 
8    WASHINGTON  DC  DIST OF COLUMBIA    38.8949549, -77.0366456 
9    WASHINGTON  DC  DIST OF COLUMBIA    38.8949549, -77.0366456
10   GLYNCO      GA  GLYNN               31.2224512, -81.5101023
Run Code Online (Sandbox Code Playgroud)
我感谢任何帮助.在我得到坐标后,我想要映射它们.任何推荐的映射坐标资源也非常受欢迎.谢谢
EdC*_*ica 14
您可以调用apply并传递要在每行上执行的函数,如下所示:
In [9]:
geolocator = Nominatim()
df['city_coord'] = df['state_name'].apply(geolocator.geocode)
df
Out[9]:
    city_name state_name       county_name  \
0  WASHINGTON         DC  DIST OF COLUMBIA   
1  WASHINGTON         DC  DIST OF COLUMBIA   
                                          city_coord  
0  (District of Columbia, United States of Americ...  
1  (District of Columbia, United States of Americ...  
Run Code Online (Sandbox Code Playgroud)
然后,您可以访问纬度和经度属性:
In [16]:
df['city_coord'] = df['city_coord'].apply(lambda x: (x.latitude, x.longitude))
df
Out[16]:
    city_name state_name       county_name                       city_coord
0  WASHINGTON         DC  DIST OF COLUMBIA  (38.8937154, -76.9877934586326)
1  WASHINGTON         DC  DIST OF COLUMBIA  (38.8937154, -76.9877934586326)
Run Code Online (Sandbox Code Playgroud)
或者通过apply两次调用在一个班轮中完成:
In [17]:
df['city_coord'] = df['state_name'].apply(geolocator.geocode).apply(lambda x: (x.latitude, x.longitude))
df
Out[17]:
    city_name state_name       county_name                       city_coord
0  WASHINGTON         DC  DIST OF COLUMBIA  (38.8937154, -76.9877934586326)
1  WASHINGTON         DC  DIST OF COLUMBIA  (38.8937154, -76.9877934586326)
Run Code Online (Sandbox Code Playgroud)
此外,您的尝试geolocator.geocode(lambda row: 'state_name' (row))没有做任何事情,因此为什么您有一个充满None值的列
编辑
@leb在这里提出了一个有趣的观点,如果你有许多重复值,那么对每个唯一值进行地理编码会更加高效,然后添加:
In [38]:
states = df['state_name'].unique()
d = dict(zip(states, pd.Series(states).apply(geolocator.geocode).apply(lambda x: (x.latitude, x.longitude))))
d
Out[38]:
{'DC': (38.8937154, -76.9877934586326)}
In [40]:    
df['city_coord'] = df['state_name'].map(d)
df
Out[40]:
    city_name state_name       county_name                       city_coord
0  WASHINGTON         DC  DIST OF COLUMBIA  (38.8937154, -76.9877934586326)
1  WASHINGTON         DC  DIST OF COLUMBIA  (38.8937154, -76.9877934586326)
Run Code Online (Sandbox Code Playgroud)
所以上面得到了所有的唯一值unique,从它们构造一个dict,然后调用map执行查找并添加coords,这比尝试按行进行地理编码更有效
投票并接受@EdChum 的回答,我只是想对此进行补充。他的方法非常有效,但从个人经验来看,我想分享一些事情:
处理地理编码时,如果您有多个重复的城市/州组合,则仅发送 1 个进行地理编码,然后将其余的复制到下面的其他行会更快:
这对于大数据非常有帮助,可以通过两种方式完成:
drop_duplicategroup_by城市/州组合),请通过调用 对第一个行应用地理编码head(1),然后复制到其余行。原因是每次您致电 Nominatim 时,即使您在同一城市/州连续排队,也会出现小延迟问题。当您的数据变大时,这种小的延迟会变得更糟,导致响应的巨大延迟和可能的超时。
再说一遍,这都是个人处理的。如果现在对您没有好处,请记住以供将来使用。
|   归档时间:  |  
           
  |  
        
|   查看次数:  |  
           8043 次  |  
        
|   最近记录:  |