Kin*_*ngz 10 python group-by dataframe pandas pandas-groupby
如何从数据框中获取所有现有的重复记录集(基于列)?
我有一个数据帧如下:
flight_id | from_location | to_location | schedule |
1 | Vancouver | Toronto | 3-Jan |
2 | Amsterdam | Tokyo | 15-Feb |
4 | Fairbanks | Glasgow | 12-Jan |
9 | Halmstad | Athens | 21-Jan |
3 | Brisbane | Lisbon | 4-Feb |
4 | Johannesburg | Venice | 12-Jan |
9 | LosAngeles | Perth | 3-Mar |
Run Code Online (Sandbox Code Playgroud)
这里的flight_id是我需要检查重复项的列.并且有两组重复.
此特定示例的输出应为 - [(2,5),(3,6)].记录索引值的元组列表
这是你需要的吗?duplicated+groupby
(df.loc[df['flight_id'].duplicated(keep=False)].reset_index()).groupby('flight_id')['index'].apply(tuple)
Out[510]:
flight_id
4 (2, 5)
9 (3, 6)
Name: index, dtype: object
Run Code Online (Sandbox Code Playgroud)
添加tolist在最后
(df.loc[df['flight_id'].duplicated(keep=False)].reset_index()).groupby('flight_id')['index'].apply(tuple).tolist()
Out[511]: [(2, 5), (3, 6)]
Run Code Online (Sandbox Code Playgroud)
还有另一种解决方案......只为了好玩
s=df['flight_id'].value_counts()
list(map(lambda x : tuple(df[df['flight_id']==x].index.tolist()), s[s.gt(1)].index))
Out[519]: [(2, 5), (3, 6)]
Run Code Online (Sandbox Code Playgroud)
使用apply和alambda
df.groupby('flight_id').apply(
lambda d: tuple(d.index) if len(d.index) > 1 else None
).dropna()
flight_id
4 (2, 5)
9 (3, 6)
dtype: object
Run Code Online (Sandbox Code Playgroud)
或者通过groupby对象的迭代更好
{k: tuple(d.index) for k, d in df.groupby('flight_id') if len(d) > 1}
{4: (2, 5), 9: (3, 6)}
Run Code Online (Sandbox Code Playgroud)
只是元组
[tuple(d.index) for k, d in df.groupby('flight_id') if len(d) > 1]
[(2, 5), (3, 6)]
Run Code Online (Sandbox Code Playgroud)
留给后人
但我现在非常不喜欢这种方法.这太糟糕了.
我正在搞乱itertools.groupby
别人可能会觉得这很有趣
from itertools import groupby
key = df.flight_id.get
s = sorted(df.index, key=key)
dict(filter(
lambda t: len(t[1]) > 1,
((k, tuple(g)) for k, g in groupby(s, key))
))
{4: (2, 5), 9: (3, 6)}
Run Code Online (Sandbox Code Playgroud)
执行groupby开启df.index可以带你到位.
v = df.index.to_series().groupby(df.flight_id).apply(pd.Series.tolist)
v[v.str.len().gt(1)]
flight_id
4 [2, 5]
9 [3, 6]
dtype: object
Run Code Online (Sandbox Code Playgroud)
您还可以得到可爱与刚 groupby上df.index直接.
v = pd.Series(df.index.groupby(df.flight_id))
v[v.str.len().gt(1)].to_dict()
{
"4": [
2,
5
],
"9": [
3,
6
]
}
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
922 次 |
| 最近记录: |