有更快的方法吗?(python twitter位置)

Job*_*obs 1 python algorithm twitter python-twitter

我正在尝试返回一个字典,汇总最近的州中心的推文.我正在迭代所有推文,并且对于每条推文,我正在检查所有状态以查看哪个状态最接近.

什么是更好的方法来做到这一点?

def group_tweets_by_state(tweets):
    """

    The keys of the returned dictionary are state names, and the values are
    lists of tweets that appear closer to that state center than any other.

    tweets -- a sequence of tweet abstract data types """


    tweets_by_state = {}
    for tweet in tweets:
        position = tweet_location(tweet)
        min, result_state = 100000, 'CA'
        for state in us_states:
            if geo_distance(position, find_state_center(us_states[state]))< min:
                min = geo_distance(position, find_state_center(us_states[state]))
                result_state = state
        if result_state not in tweets_by_state:
            tweets_by_state[result_state]= []
            tweets_by_state[result_state].append(tweet)
        else:
            tweets_by_state[result_state].append(tweet)
    return tweets_by_state
Run Code Online (Sandbox Code Playgroud)

Pau*_* Lo 5

当推文数量非常大时,那个庞大的for循环中的每一个小增强都会导致时间复杂度带来巨大的性能提升,我能想到的事情很少:

1.只需拨打geo_distance()一次,特别是在费用昂贵时

distance = geo_distance(position, find_state_center(us_states[state]))
if distance < min:
     min = distance
Run Code Online (Sandbox Code Playgroud)

而不是

if geo_distance(position, find_state_center(us_states[state]))< min:
    min = geo_distance(position, find_state_center(us_states[state]))
Run Code Online (Sandbox Code Playgroud)

2.如果经常重复职位:

position_closest_state = {}  # save known result 
tweets_by_state = {}
for tweet in tweets:
    position = tweet_location(tweet)
    min, result_state = 100000, 'CA'

    if position in position_closest_state:
        result_state = position_closest_state[position]
    else:
        for state in us_states:
            distance = geo_distance(position, find_state_center(us_states[state]))
            if distance < min:
                min = distance
                result_state = state
                position_closest_state[position] = result_state 
Run Code Online (Sandbox Code Playgroud)

所以,假设你有来自200个不同位置的1000条推文,并且us_states是50,你的原始算法会调用geo_distance()1000*50*2次,现在它可以减少到调用的200*50*1倍.

3.减少调用次数 find_state_center()

与#2类似,现在每个状态的每个推文都会冗余调用它.

state_center_dict = {}
for state in us_states:
    state_center_dict[state] = find_state_center(us_states[state])

position_closest_state = {}  # save known result 
tweets_by_state = {}
for tweet in tweets:
    position = tweet_location(tweet)
    min, result_state = 100000, 'CA'

    if position in position_closest_state:
        result_state = position_closest_state[position]
    else:
        for state in us_states:
            distance = geo_distance(position, state_center_dict[state])
            if distance < min:
                min = distance
                result_state = state
                position_closest_state[position] = result_state 
Run Code Online (Sandbox Code Playgroud)

现在find_state_center()只调用50次(状态数)而不是50*1000(推文数),我们又做了一次巨大的改进!

绩效成就总结

通过#1,我们将性能提高了一倍.#2我们通过(推文数量/位置数)次数来增强它.#3是三者中最大的,与原始代码相比,我们将时间复杂度降低到仅1 /(推文数).