将伪算法转换为 python -> pandas 代码

Emi*_*yev 5 python dataframe pandas

我正在尝试将伪代码转换为 pandas 代码。将不胜感激任何帮助或指导。

总体思路是提出一个函数,f从玩具示例数据集中选择行,该数据集有 100 行和 5 列,["X", "Y", "Z", "F", "V"]随机填充 [0, 500] 之间的数字。除了数据之外,该函数的第二个输入是cols_to_use在选择中应使用的列,默认情况下是使用所有列

描述。目标是从示例数据集中选择 10 行。函数的第二个参数有 5 个概率 -> 基于 [1, 2, 3, 4, 5] 列进行选择。

如果必须使用所有列,那么我们每列选择 2 行。我们选择每列前 2 个值对应的行。在初始选择期间可能会有重叠的行。我们称之为overlap1事件。如果overlap1事件发生,我们随机选择一列,为其保留重叠行,而对于其他行,我们添加第三行。在此过程中,新选定的和已选定的也可能会重叠 -> 称之为overlap2event。如果overlap2发生这种情况,请使用该列的前 4 行、前 5 行等。在初始选择过程中,平均有 0.25 的概率会出现至少一次重叠,因此考虑这一点非常重要。最终选择必须包含 10 个唯一行。

如果有 4 列作为选择的基础,我们选择与每列的前 2 个值对应的行并解决重叠 1 事件。但我们仍然需要选择另外 2 行。因此,我们从这 4 列中随机抽取 2 列,并为它们选择与第三列对应的附加行 -> ,或者当重叠 2 发生在第四列时,依此类推。

如果有3列,则按照上述规则每列选择3行+重叠1解决方案(如果有),并随机选择一列我们应该添加剩余的1个选项+解决重叠2事件

当必须使用 2 列时,每列选择 5 行 + 重叠 1 和 2 事件

当仅必须使用 1 列时,选择与该列的最高 10 个值相对应的前 10 行

# sample dataset to work with

sample = pd.DataFrame(np.random.randint(0, 500, size = (100, 5)))
sample.columns = "X Y Z F V".split()

# the function I have written so far
def f(df, cols_to_use = ["X", "Y", "Z", "F", "V"]):

    how_many_per_feature = {
        5:2,
        4:2,
        3:3,
        2:5,
        1:10
    }
    n_per_group = how_many_per_feature[len(cols_to_use)]

    # columns to randomly choose when adding extra options
    # could not find a proper way to implement this
    
    if len(cols_to_use) == 4:
        randomly_selected_columns = random.sample(cols_to_use, 2)
    elif len(cols_to_use) == 3:
        randomly_selected_columns = random.sample(cols_to_use, 1)
    
    
    # first I filter the dataframe on columns I need
    filtered_df = df[cols_to_use]
    
    # using pandas melt to select top n_per_group
    
    
    result = col_filtered.melt(id_vars = "obj_function",
                        var_name = "feature",
                        ignore_index = False,
                        ).groupby("feature").value.nlargest(n_per_group)

    # here supposed to handle overlap1 events
    
    # here overlap2 events
                        
    index = result.reset_index().level_1.values
    
    return df.iloc[index,:]
 
Run Code Online (Sandbox Code Playgroud)

我无法实现基于重叠事件处理的动态选择。

hyi*_*yit 1

也许是这样的?它不是最干净的,但我认为它可以完成工作。
不过我并不清楚你的意图There are 5 probabilities for the second argument of the function -> selection based on [1, 2, 3, 4, 5] columns.
这里的想法是将列分开并单独对每个列进行排序。然后,我们可以根据需要为每列选择和删除索引。检查重叠是通过检查任何可能的列组合之间的公共索引来完成的,我们不断迭代直到解决所有重叠(即解决overlap3overlap4事件)。

from itertools import combinations
from functools import reduce
import numpy as np

def f(df, cols_to_use=None, n_rows=10):
    df = df.copy()  # Ensure we don't modify the original dataframe
    cols_to_use = cols_to_use or ["X", "Y", "Z", "F", "V"]
    n_rows_per_col = n_rows // len(cols_to_use)
    additional_rows = n_rows % len(cols_to_use)
    top = dict()
    indices = dict()

    # Sort and create initial indices set
    for col in cols_to_use:
        top[col] = df[col].sort_values(ascending=False)
        indices[col] = top[col].iloc[:n_rows_per_col].index
        top[col] = top[col].drop(index=indices[col])  # Remove added indices

    # Ensure we have exactly n_rows indices
    if additional_rows:
        # select additional_rows columns from our cols_to_use
        more_cols = np.random.choice(cols_to_use, size=additional_rows)
        for c in more_cols:
            indices[col] = indices[col].union(top[col].iloc[[0]].index)
            top[col] = top[col].drop(index=indices[col], errors='ignore')

    # Resolve overlap events as needed
    combs = list(combinations(cols_to_use, r=2))
    overlap = True
    while overlap:
        overlap = False
        for col1, col2 in combs:
            intersect = indices[col1].intersection(indices[col2])
            if intersect.shape[0]:  # Overlap between col1 and col2
                overlap = True
                # Consider all columns that contain this intersect
                cols = [c for c in cols_to_use if intersect.isin(indices[c]).any()]
                # Choose which column to remove the overlap from and use the following entries from its top value
                c = np.random.choice(cols)
                indices[c] = indices[c].drop(intersect).union(top[c].iloc[:intersect.shape[0]].index)
                top[c] = top[col].drop(index=indices[c], errors='ignore')
    return df[cols_to_use].loc[reduce(lambda x, y: x.union(y), indices.values())]
Run Code Online (Sandbox Code Playgroud)

编辑:确保每个overlap事件中的所有相关列都可用于随机选择,无论它们是否属于combs识别overlap事件的一部分。