the*_*man 8 python numpy dataframe pandas
我有一个类似于下面的 DataFrame:,我想向它添加一个 Streak 列(参见下面的示例):
Date Home_Team Away_Team Winner Streak
2005-08-06 A G A 0
2005-08-06 B H H 0
2005-08-06 C I C 0
2005-08-06 D J J 0
2005-08-06 E K K 0
2005-08-06 F L F 0
2005-08-13 A B A 1
2005-08-13 C D D 1
2005-08-13 E F F 0
2005-08-13 G H H 0
2005-08-13 I J J 0
2005-08-13 K L K 1
2005-08-20 B C B 0
2005-08-20 A D A 2
2005-08-20 G K K 0
2005-08-20 I E E 0
2005-08-20 F H F 2
2005-08-20 J L J 2
2005-08-27 A H A 3
2005-08-27 B F B 1
2005-08-27 J C C 3
2005-08-27 D E D 0
2005-08-27 I K K 0
2005-08-27 L G G 0
2005-09-05 B A A 2
2005-09-05 D C D 1
2005-09-05 F E F 0
2005-09-05 H G H 0
2005-09-05 J I I 0
2005-09-05 K L K 4
Run Code Online (Sandbox Code Playgroud)
从 2005 年到 2020 年,DataFrame 大约有 20 万行。
现在,我想要做的是在 DataFrame 的 Date 列中找到主队在该日期之前赢得的连续比赛次数。我有一个解决方案,但它太慢了,见下文:
df["Streak"] = 0
def home_streak(x): # x is a row of the DataFrame
"""Keep track of a team's winstreak"""
home_team = x["Home_Team"]
date = x["Date"]
# all previous matches for the home team
home_df = df[(df["Home_Team"] == home_team) | (df["Away_Team"] == home_team)]
home_df = home_df[home_df["Date"] < date].sort_values(by="Date", ascending=False).reset_index()
if len(home_df.index) == 0: # no previous matches for that team, so start streak at 0
return 0
elif home_df.iloc[0]["Winner"] != home_team: # lost the last match
return 0
else: # they won the last game
winners = home_df["Winner"]
streak = 0
for i in winners.index:
if home_df.iloc[i]["Winner"] == home_team:
streak += 1
else: # they lost, return the streak
return streak
df["Streak"] = df.apply(lambda x: home_streak(x), axis = 1)
Run Code Online (Sandbox Code Playgroud)
我怎样才能加快速度?
我将在这里提出一个基于 numpy 的解决方案。首先是因为我对 pandas 不太熟悉并且不想进行研究,其次是因为 numpy 解决方案无论如何都应该可以正常工作。
让我们首先看看某个特定团队会发生什么情况。您的目标是根据球队参加比赛的顺序找到球队连续获胜的次数。对于初学者来说,我将删除日期列并将您的数据转换为 numpy 数组:
x = np.array([
['A', 'G', 'A'],
['B', 'H', 'H'],
['C', 'I', 'C'],
['D', 'J', 'J'],
['E', 'K', 'K'],
['F', 'L', 'F'],
['A', 'B', 'A'],
['C', 'D', 'D'],
['E', 'F', 'F'],
['G', 'H', 'H'],
['I', 'J', 'J'],
['K', 'L', 'K'],
['B', 'C', 'B'],
['A', 'D', 'A'],
['G', 'K', 'K'],
['I', 'E', 'E'],
['F', 'H', 'F'],
['J', 'L', 'J']])
Run Code Online (Sandbox Code Playgroud)
您不需要日期,因为您只关心谁玩过,即使他们在一天内玩过多次。那么让我们来看看团队A:
A_played = np.flatnonzero((x[:, :2] == 'A').any(axis=1))
A_won = x[A_played, -1] == 'A'
Run Code Online (Sandbox Code Playgroud)
A_played是一个索引数组,其元素数与 中的行数相同x。A_won是一个掩码,其元素数量为np.count_nonzero(A_played); A即参加的比赛数量。
求出条纹的大小是一个经过精心讨论的问题:
streaks = np.diff(np.flatnonzero(np.diff(np.r_[False, A_won, False])))[::2]
Run Code Online (Sandbox Code Playgroud)
您可以计算掩码值切换的每对索引之间的差异。额外的填充False可确保您知道面罩正在切换的方向。您正在寻找的内容基于此计算,但需要更多细节,因为您需要累积总和,但在每次运行后重置。您可以通过在运行后立即将数据值设置为负运行长度来做到这一点:
wins = np.r_[0, A_won, 0] # Notice the int dtype here
switch_indices = np.flatnonzero(np.diff(wins)) + 1
streaks = np.diff(switch_indices)[::2]
wins[switch_indices[1::2]] = -streaks
Run Code Online (Sandbox Code Playgroud)
现在您有了一个可修剪数组,其累积和可以直接分配给输出列:
streak_counts = np.cumsum(wins[:-2])
output = np.zeros((x.shape[0], 2), dtype=int)
# Home streak
home_mask = x[A_played, 0] == 'A'
output[A_played[home_mask], 0] = streak_counts[home_mask]
# Away streak
away_mask = ~home_mask
output[A_played[away_mask], 1] = streak_counts[away_mask]
Run Code Online (Sandbox Code Playgroud)
现在您可以循环所有球队(与比赛总数相比,这应该是一个相当小的数字):
def process_team(data, team, output):
played = np.flatnonzero((data[:, :2] == team).any(axis=1))
won = data[played, -1] == team
wins = np.r_[0, won, 0]
switch_indices = np.flatnonzero(np.diff(wins)) + 1
streaks = np.diff(switch_indices)[::2]
wins[switch_indices[1::2]] = -streaks
streak_counts = np.cumsum(wins[:-2])
home_mask = data[played, 0] == team
away_mask = ~home_mask
output[played[home_mask], 0] = streak_counts[home_mask]
output[played[away_mask], 1] = streak_counts[away_mask]
output = np.empty((x.shape[0], 2), dtype=int)
# Assume every team has been home team at least once.
# If not, x[:, :2].ravel() copies the data and np.unique(x[:, :2]) does too
for team in set(x[:, 0]):
process_team(x, team, output)
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
305 次 |
| 最近记录: |