Emi*_*yev 5 python dataframe pandas
我正在尝试将伪代码转换为 pandas 代码。将不胜感激任何帮助或指导。
总体思路是提出一个函数,f从玩具示例数据集中选择行,该数据集有 100 行和 5 列,["X", "Y", "Z", "F", "V"]随机填充 [0, 500] 之间的数字。除了数据之外,该函数的第二个输入是cols_to_use在选择中应使用的列,默认情况下是使用所有列。
描述。目标是从示例数据集中选择 10 行。函数的第二个参数有 5 个概率 -> 基于 [1, 2, 3, 4, 5] 列进行选择。
如果必须使用所有列,那么我们每列选择 2 行。我们选择每列前 2 个值对应的行。在初始选择期间可能会有重叠的行。我们称之为overlap1事件。如果overlap1事件发生,我们随机选择一列,为其保留重叠行,而对于其他行,我们添加第三行。在此过程中,新选定的和已选定的也可能会重叠 -> 称之为overlap2event。如果overlap2发生这种情况,请使用该列的前 4 行、前 5 行等。在初始选择过程中,平均有 0.25 的概率会出现至少一次重叠,因此考虑这一点非常重要。最终选择必须包含 10 个唯一行。
如果有 4 列作为选择的基础,我们选择与每列的前 2 个值对应的行并解决重叠 1 事件。但我们仍然需要选择另外 2 行。因此,我们从这 4 列中随机抽取 2 列,并为它们选择与第三列对应的附加行 -> ,或者当重叠 2 发生在第四列时,依此类推。
如果有3列,则按照上述规则每列选择3行+重叠1解决方案(如果有),并随机选择一列我们应该添加剩余的1个选项+解决重叠2事件
当必须使用 2 列时,每列选择 5 行 + 重叠 1 和 2 事件
当仅必须使用 1 列时,选择与该列的最高 10 个值相对应的前 10 行
# sample dataset to work with
sample = pd.DataFrame(np.random.randint(0, 500, size = (100, 5)))
sample.columns = "X Y Z F V".split()
# the function I have written so far
def f(df, cols_to_use = ["X", "Y", "Z", "F", "V"]):
how_many_per_feature = {
5:2,
4:2,
3:3,
2:5,
1:10
}
n_per_group = how_many_per_feature[len(cols_to_use)]
# columns to randomly choose when adding extra options
# could not find a proper way to implement this
if len(cols_to_use) == 4:
randomly_selected_columns = random.sample(cols_to_use, 2)
elif len(cols_to_use) == 3:
randomly_selected_columns = random.sample(cols_to_use, 1)
# first I filter the dataframe on columns I need
filtered_df = df[cols_to_use]
# using pandas melt to select top n_per_group
result = col_filtered.melt(id_vars = "obj_function",
var_name = "feature",
ignore_index = False,
).groupby("feature").value.nlargest(n_per_group)
# here supposed to handle overlap1 events
# here overlap2 events
index = result.reset_index().level_1.values
return df.iloc[index,:]
Run Code Online (Sandbox Code Playgroud)
我无法实现基于重叠事件处理的动态选择。
也许是这样的?它不是最干净的,但我认为它可以完成工作。
不过我并不清楚你的意图There are 5 probabilities for the second argument of the function -> selection based on [1, 2, 3, 4, 5] columns.。
这里的想法是将列分开并单独对每个列进行排序。然后,我们可以根据需要为每列选择和删除索引。检查重叠是通过检查任何可能的列组合之间的公共索引来完成的,我们不断迭代直到解决所有重叠(即解决overlap3等overlap4事件)。
from itertools import combinations
from functools import reduce
import numpy as np
def f(df, cols_to_use=None, n_rows=10):
df = df.copy() # Ensure we don't modify the original dataframe
cols_to_use = cols_to_use or ["X", "Y", "Z", "F", "V"]
n_rows_per_col = n_rows // len(cols_to_use)
additional_rows = n_rows % len(cols_to_use)
top = dict()
indices = dict()
# Sort and create initial indices set
for col in cols_to_use:
top[col] = df[col].sort_values(ascending=False)
indices[col] = top[col].iloc[:n_rows_per_col].index
top[col] = top[col].drop(index=indices[col]) # Remove added indices
# Ensure we have exactly n_rows indices
if additional_rows:
# select additional_rows columns from our cols_to_use
more_cols = np.random.choice(cols_to_use, size=additional_rows)
for c in more_cols:
indices[col] = indices[col].union(top[col].iloc[[0]].index)
top[col] = top[col].drop(index=indices[col], errors='ignore')
# Resolve overlap events as needed
combs = list(combinations(cols_to_use, r=2))
overlap = True
while overlap:
overlap = False
for col1, col2 in combs:
intersect = indices[col1].intersection(indices[col2])
if intersect.shape[0]: # Overlap between col1 and col2
overlap = True
# Consider all columns that contain this intersect
cols = [c for c in cols_to_use if intersect.isin(indices[c]).any()]
# Choose which column to remove the overlap from and use the following entries from its top value
c = np.random.choice(cols)
indices[c] = indices[c].drop(intersect).union(top[c].iloc[:intersect.shape[0]].index)
top[c] = top[col].drop(index=indices[c], errors='ignore')
return df[cols_to_use].loc[reduce(lambda x, y: x.union(y), indices.values())]
Run Code Online (Sandbox Code Playgroud)
编辑:确保每个overlap事件中的所有相关列都可用于随机选择,无论它们是否属于combs识别overlap事件的一部分。