如何从列中删除pandas.DataFrame很少发生的值,即频率较低?例:
In [4]: df[col_1].value_counts()
Out[4]: 0 189096
1 110500
2 77218
3 61372
...
2065 1
2067 1
1569 1
dtype: int64
Run Code Online (Sandbox Code Playgroud)
所以,我的问题是:如何删除像2065, 2067, 1569和其他人一样的价值观?我怎么能对包含.value_counts()这样的所有列执行此操作?
更新:关于'低'我的意思是像2065.此值出现col_11(一)次,我想删除这样的值.
the*_*cus 24
我发现你可能有两种方法可以做到这一点.
对于整个DataFrame
此方法删除整个DataFrame中不经常出现的值.我们可以在没有循环的情况下使用内置函数来加快速度.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0, high=9, size=(100,2)),
columns = ['A', 'B'])
threshold = 10 # Anything that occurs less than this will be removed.
value_counts = df.stack().value_counts() # Entire DataFrame
to_remove = value_counts[value_counts <= threshold].index
df.replace(to_remove, np.nan, inplace=True)
Run Code Online (Sandbox Code Playgroud)
列逐列
此方法删除每列中不经常出现的条目.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0, high=9, size=(100,2)),
columns = ['A', 'B'])
threshold = 10 # Anything that occurs less than this will be removed.
for col in df.columns:
value_counts = df[col].value_counts() # Specific column
to_remove = value_counts[value_counts <= threshold].index
df[col].replace(to_remove, np.nan, inplace=True)
Run Code Online (Sandbox Code Playgroud)
如果只有一列的值低于您的阈值,您可能不想删除 DataFrame 中的整行,因此我只是删除了这些数据点并将它们替换为None.
我遍历每一列并对每一列执行 a value_counts。然后,我获取发生在目标阈值或低于目标阈值的每个项目的索引值。最后,我使用.loc在列中定位这些元素值,然后将它们替换为None.
df = pd.DataFrame({'A': ['a', 'b', 'b', 'c', 'c'],
'B': ['a', 'a', 'b', 'c', 'c'],
'C': ['a', 'a', 'b', 'b', 'c']})
>>> df
A B C
0 a a a
1 b a a
2 b b b
3 c c b
4 c c c
threshold = 1 # Remove items less than or equal to threshold
for col in df:
vc = df[col].value_counts()
vals_to_remove = vc[vc <= threshold].index.values
df[col].loc[df[col].isin(vals_to_remove)] = None
>>> df
A B C
0 None a a
1 b a a
2 b None b
3 c c b
4 c c None
Run Code Online (Sandbox Code Playgroud)