Let*_*t4U 6 python pandas pandas-groupby
我有一个带有+ 100K行的数据框,如下所示:
user document
0 john book
1 jane article
2 jane book
3 jane book
4 jim article
5 john book
6 jim blogpost
7 jane blogpost
8 jane blogpost
9 jane blogpost
Run Code Online (Sandbox Code Playgroud)
我需要这样的数据框:
blogpost article book
john 1 3 0
jane 0 0 1
jim 4 0 2
Run Code Online (Sandbox Code Playgroud)
也就是说,每个user, document组合都需要下载数量。
我正在做.groupby(['user', 'document']),然后df.loc用来设置下载数量:
df = pd.DataFrame(index=users, columns=documents)
df.fillna(0, inplace=True)
grouped = records.groupby(['user', 'document'])
for elem in grouped:
user, document = elem[0]
downloads = len(elem[1])
df.loc[user, document] = downloads
Run Code Online (Sandbox Code Playgroud)
但是,df.loc以这种方式使用时会非常慢...我注释掉了df.loc..一行,发现循环快速完成,因此几乎可以肯定的是df.loc访问速度很慢。
我如何更快地获得此结果?
最低工作示例:
records = pd.DataFrame([
('john', 'book'),
('jane', 'article'),
('jane','book'),
('jane','book'),
('jim', 'article'),
('john', 'book'),
('jim', 'blogpost'),
('jane', 'blogpost'),
('jane', 'blogpost'),
('jane', 'blogpost')
], columns=['user', 'document'])
print(records)
users = list(set(records['user']))
users.sort()
documents = list(set(records['document']))
documents.sort()
print(users)
print(documents)
df = pd.DataFrame(index=users, columns=documents)
df.fillna(0, inplace=True)
print(df)
grouped = records.groupby(['user', 'document'])
for elem in grouped:
user, document = elem[0]
downloads = len(elem[1])
df.loc[user, document] = downloads
Run Code Online (Sandbox Code Playgroud)
有很多的方式实现这一点没有循环,pivot,pivot_table,crosstab,groupby count
pd.crosstab(df.user,df.document)
Out[1283]:
document article blogpost book
user
jane 1 3 2
jim 1 1 0
john 0 0 2
Run Code Online (Sandbox Code Playgroud)