如何从两列计算数据帧

han*_*bop 0 python pandas

我有数据框,想计算名称在两列中出现的次数

data=pd.DataFrame({'TEAM 1':['Mark','Peter','Andy','Tony'],'Team 2':
                  ['Andy','Tony','Jhon','Peter']})
Run Code Online (Sandbox Code Playgroud)

所以名字 Andy 会算作 2,而 Jhon 是 1
预期输出

Mark 1
Andy 2
Tony 2
Jhon 1
Peter 2
Run Code Online (Sandbox Code Playgroud)

我使用了这段代码,但它不起作用

data.groupby('TEAM 1')['Team 2'].count()

Run Code Online (Sandbox Code Playgroud)

Cyt*_*rak 8

使用stackvalue_counts

>>> data.stack().value_counts()
Andy     2
Tony     2
Peter    2
Jhon     1
Mark     1
Run Code Online (Sandbox Code Playgroud)

正如 中所指出的那样Ch3steR's comment,在调用之前将 df 转换为numpy.array扁平化它会产生大约 2X 快的结果:ravelpd.value_counts

>>> pd.value_counts(data.to_numpy().ravel())
Andy     2
Tony     2
Peter    2
Jhon     1
Mark     1
dtype: int64
Run Code Online (Sandbox Code Playgroud)

基准测试:

>>> data = pd.concat([data] * 1000000)   # 4_000_000 rows

>>> %timeit data.stack().value_counts()
1.21 s ± 27.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit pd.value_counts(data.to_numpy().ravel())
667 ms ± 16.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Run Code Online (Sandbox Code Playgroud)

更新:据证明更快:anky's comment collections.Counter

>>> %timeit pd.Series(Counter(np.ravel(data)))
501 ms ± 4.28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Run Code Online (Sandbox Code Playgroud)