在 Pandas DataFrame 列中有效地找到连续的条纹?

the*_*man 8 python numpy dataframe pandas

我有一个类似于下面的 DataFrame:,我想向它添加一个 Streak 列(参见下面的示例):

Date         Home_Team    Away_Team    Winner      Streak

2005-08-06       A            G           A           0
2005-08-06       B            H           H           0
2005-08-06       C            I           C           0
2005-08-06       D            J           J           0
2005-08-06       E            K           K           0
2005-08-06       F            L           F           0
2005-08-13       A            B           A           1           
2005-08-13       C            D           D           1           
2005-08-13       E            F           F           0        
2005-08-13       G            H           H           0
2005-08-13       I            J           J           0
2005-08-13       K            L           K           1
2005-08-20       B            C           B           0
2005-08-20       A            D           A           2
2005-08-20       G            K           K           0
2005-08-20       I            E           E           0
2005-08-20       F            H           F           2
2005-08-20       J            L           J           2
2005-08-27       A            H           A           3
2005-08-27       B            F           B           1
2005-08-27       J            C           C           3           
2005-08-27       D            E           D           0
2005-08-27       I            K           K           0
2005-08-27       L            G           G           0
2005-09-05       B            A           A           2
2005-09-05       D            C           D           1
2005-09-05       F            E           F           0
2005-09-05       H            G           H           0
2005-09-05       J            I           I           0
2005-09-05       K            L           K           4
Run Code Online (Sandbox Code Playgroud)

从 2005 年到 2020 年,DataFrame 大约有 20 万行。

现在,我想要做的是在 DataFrame 的 Date 列中找到主队在该日期之前赢得的连续比赛次数。我有一个解决方案,但它太慢了,见下文:

df["Streak"] = 0
def home_streak(x): # x is a row of the DataFrame
    """Keep track of a team's winstreak"""
    home_team = x["Home_Team"]
    date = x["Date"]
    
    # all previous matches for the home team 
    home_df = df[(df["Home_Team"] == home_team) | (df["Away_Team"] == home_team)]
    home_df = home_df[home_df["Date"] <  date].sort_values(by="Date", ascending=False).reset_index()
    if len(home_df.index) == 0: # no previous matches for that team, so start streak at 0
        return 0
    elif home_df.iloc[0]["Winner"] != home_team: # lost the last match
        return 0
    else: # they won the last game
        winners = home_df["Winner"]
        streak = 0
        for i in winners.index:
            if home_df.iloc[i]["Winner"] == home_team:
                streak += 1
            else: # they lost, return the streak
                return streak

df["Streak"] = df.apply(lambda x: home_streak(x), axis = 1)
Run Code Online (Sandbox Code Playgroud)

我怎样才能加快速度?

Mad*_*ist 4

我将在这里提出一个基于 numpy 的解决方案。首先是因为我对 pandas 不太熟悉并且不想进行研究,其次是因为 numpy 解决方案无论如何都应该可以正常工作。

让我们首先看看某个特定团队会发生什么情况。您的目标是根据球队参加比赛的顺序找到球队连续获胜的次数。对于初学者来说,我将删除日期列并将您的数据转换为 numpy 数组:

x = np.array([
    ['A', 'G', 'A'],
    ['B', 'H', 'H'],
    ['C', 'I', 'C'],
    ['D', 'J', 'J'],
    ['E', 'K', 'K'],
    ['F', 'L', 'F'],
    ['A', 'B', 'A'],
    ['C', 'D', 'D'],
    ['E', 'F', 'F'],
    ['G', 'H', 'H'],
    ['I', 'J', 'J'],
    ['K', 'L', 'K'],
    ['B', 'C', 'B'],
    ['A', 'D', 'A'],
    ['G', 'K', 'K'],
    ['I', 'E', 'E'],
    ['F', 'H', 'F'],
    ['J', 'L', 'J']])
Run Code Online (Sandbox Code Playgroud)

您不需要日期,因为您只关心谁玩过,即使他们在一天内玩过多次。那么让我们来看看团队A

A_played = np.flatnonzero((x[:, :2] == 'A').any(axis=1))
A_won = x[A_played, -1] == 'A'
Run Code Online (Sandbox Code Playgroud)

A_played是一个索引数组,其元素数与 中的行数相同xA_won是一个掩码,其元素数量为np.count_nonzero(A_played); A即参加的比赛数量。

求出条纹的大小是一个经过精心讨论的问题:

streaks = np.diff(np.flatnonzero(np.diff(np.r_[False, A_won, False])))[::2]
Run Code Online (Sandbox Code Playgroud)

您可以计算掩码值切换的每对索引之间的差异。额外的填充False可确保您知道面罩正在切换的方向。您正在寻找的内容基于此计算,但需要更多细节,因为您需要累积总和,但在每次运行后重置。您可以通过在运行后立即将数据值设置为负运行长度来做到这一点:

wins = np.r_[0, A_won, 0]  # Notice the int dtype here
switch_indices = np.flatnonzero(np.diff(wins)) + 1
streaks = np.diff(switch_indices)[::2]
wins[switch_indices[1::2]] = -streaks
Run Code Online (Sandbox Code Playgroud)

现在您有了一个可修剪数组,其累积和可以直接分配给输出列:

streak_counts = np.cumsum(wins[:-2])
output = np.zeros((x.shape[0], 2), dtype=int)

# Home streak
home_mask = x[A_played, 0] == 'A'
output[A_played[home_mask], 0] = streak_counts[home_mask]

# Away streak
away_mask = ~home_mask
output[A_played[away_mask], 1] = streak_counts[away_mask]
Run Code Online (Sandbox Code Playgroud)

现在您可以循环所有球队(与比赛总数相比,这应该是一个相当小的数字):

def process_team(data, team, output):
    played = np.flatnonzero((data[:, :2] == team).any(axis=1))
    won = data[played, -1] == team
    wins = np.r_[0, won, 0]
    switch_indices = np.flatnonzero(np.diff(wins)) + 1
    streaks = np.diff(switch_indices)[::2]
    wins[switch_indices[1::2]] = -streaks
    streak_counts = np.cumsum(wins[:-2])

    home_mask = data[played, 0] == team
    away_mask = ~home_mask

    output[played[home_mask], 0] = streak_counts[home_mask]
    output[played[away_mask], 1] = streak_counts[away_mask]

output = np.empty((x.shape[0], 2), dtype=int)

# Assume every team has been home team at least once.
# If not, x[:, :2].ravel() copies the data and np.unique(x[:, :2]) does too
for team in set(x[:, 0]):
    process_team(x, team, output)
Run Code Online (Sandbox Code Playgroud)