J-H*_*J-H 5 python numpy apply pandas
I want to expand the list entries of a dataframe using the information in column i:
i   s_1         s_1        s_3
2   [1, 2, 3]   [3, 4, 5]  NaN
1   NaN         [0, 0, 0]  [2]
Run Code Online (Sandbox Code Playgroud)
The i value just indicates how often the last value of each list should be copied:
i   s_1               s_1              s_3
2   [1, 2, 3, 3, 3]   [3, 4, 5, 5, 5]  NaN
1   NaN               [0, 0, 0, 0]     [2, 2]
Run Code Online (Sandbox Code Playgroud)
I am currently using a nested apply loop:
test.apply(lambda x: x.apply(
     lambda y: np.pad(y, (0, x.i), 'constant', constant_values=y[-1]) if type(y)==list else 0), axis=1)
Run Code Online (Sandbox Code Playgroud)
However, this is very slow and if i have a lot of rows (>10.000) the code breaks. This solution seems a bit messy and i'm wondering what the best approach would be to do something like that?
您可以尝试就地扩展列表:
for col in df.loc[:, "s_1":]:
    m = df[col].notna()
    for i, v in zip(df.loc[m, "i"], df.loc[m, col]):
        v.extend([v[-1]] * i)
    df.loc[~m, col] = 0
Run Code Online (Sandbox Code Playgroud)
基准:
for col in df.loc[:, "s_1":]:
    m = df[col].notna()
    for i, v in zip(df.loc[m, "i"], df.loc[m, col]):
        v.extend([v[-1]] * i)
    df.loc[~m, col] = 0
Run Code Online (Sandbox Code Playgroud)
印刷:
from timeit import timeit
from ast import literal_eval
def get_df():
    dfs = []
    # create some big dataframe
    for i in range(5000):
        txt = """
        i   s_1         s_1        s_3
        2   [1, 2, 3]   [3, 4, 5]  NaN
        1   NaN         [0, 0, 0]  [2]  """
        df = pd.read_csv(StringIO(txt), sep=r"\s{2,}", engine="python")
        df.loc[:, "s_1":] = df.loc[:, "s_1":].apply(
            lambda x: [v if pd.isna(v) else literal_eval(v) for v in x]
        )
        dfs.append(df)
    return pd.concat(dfs)
def f1(df):
    for col in df.loc[:, "s_1":]:
        m = df[col].notna()
        for i, v in zip(df.loc[m, "i"], df.loc[m, col]):
            v.extend([v[-1]] * i)
        df.loc[~m, col] = 0
    return df
def f2(df):
    df = df.apply(
        lambda x: x.apply(
            lambda y: np.pad(y, (0, x.i), "constant", constant_values=y[-1])
            if type(y) == list
            else 0
        ),
        axis=1,
    )
    return df
df1 = get_df()
df2 = get_df()
t1 = timeit(lambda: f1(df1), number=1)
t2 = timeit(lambda: f2(df2), number=1)
print(t1)
print(t2)
Run Code Online (Sandbox Code Playgroud)
所以改进~200x
|   归档时间:  |  
           
  |  
        
|   查看次数:  |  
           485 次  |  
        
|   最近记录:  |