Job*_*obs 1 python algorithm twitter python-twitter
我正在尝试返回一个字典,汇总最近的州中心的推文.我正在迭代所有推文,并且对于每条推文,我正在检查所有状态以查看哪个状态最接近.
什么是更好的方法来做到这一点?
def group_tweets_by_state(tweets):
"""
The keys of the returned dictionary are state names, and the values are
lists of tweets that appear closer to that state center than any other.
tweets -- a sequence of tweet abstract data types """
tweets_by_state = {}
for tweet in tweets:
position = tweet_location(tweet)
min, result_state = 100000, 'CA'
for state in us_states:
if geo_distance(position, find_state_center(us_states[state]))< min:
min = geo_distance(position, find_state_center(us_states[state]))
result_state = state
if result_state not in tweets_by_state:
tweets_by_state[result_state]= []
tweets_by_state[result_state].append(tweet)
else:
tweets_by_state[result_state].append(tweet)
return tweets_by_state
Run Code Online (Sandbox Code Playgroud)
当推文数量非常大时,那个庞大的for循环中的每一个小增强都会导致时间复杂度带来巨大的性能提升,我能想到的事情很少:
geo_distance()一次,特别是在费用昂贵时distance = geo_distance(position, find_state_center(us_states[state]))
if distance < min:
min = distance
Run Code Online (Sandbox Code Playgroud)
而不是
if geo_distance(position, find_state_center(us_states[state]))< min:
min = geo_distance(position, find_state_center(us_states[state]))
Run Code Online (Sandbox Code Playgroud)
position_closest_state = {} # save known result
tweets_by_state = {}
for tweet in tweets:
position = tweet_location(tweet)
min, result_state = 100000, 'CA'
if position in position_closest_state:
result_state = position_closest_state[position]
else:
for state in us_states:
distance = geo_distance(position, find_state_center(us_states[state]))
if distance < min:
min = distance
result_state = state
position_closest_state[position] = result_state
Run Code Online (Sandbox Code Playgroud)
所以,假设你有来自200个不同位置的1000条推文,并且us_states是50,你的原始算法会调用geo_distance()1000*50*2次,现在它可以减少到调用的200*50*1倍.
find_state_center()与#2类似,现在每个状态的每个推文都会冗余调用它.
state_center_dict = {}
for state in us_states:
state_center_dict[state] = find_state_center(us_states[state])
position_closest_state = {} # save known result
tweets_by_state = {}
for tweet in tweets:
position = tweet_location(tweet)
min, result_state = 100000, 'CA'
if position in position_closest_state:
result_state = position_closest_state[position]
else:
for state in us_states:
distance = geo_distance(position, state_center_dict[state])
if distance < min:
min = distance
result_state = state
position_closest_state[position] = result_state
Run Code Online (Sandbox Code Playgroud)
现在find_state_center()只调用50次(状态数)而不是50*1000(推文数),我们又做了一次巨大的改进!
通过#1,我们将性能提高了一倍.#2我们通过(推文数量/位置数)次数来增强它.#3是三者中最大的,与原始代码相比,我们将时间复杂度降低到仅1 /(推文数).
| 归档时间: |
|
| 查看次数: |
148 次 |
| 最近记录: |