pandas应用函数将多个值返回到pandas dataframe中的行

Question

pandas应用函数将多个值返回到pandas dataframe中的行

我有一个带有timeindex的数据帧和包含3D向量坐标的3列:

                         x             y             z
ts
2014-05-15 10:38         0.120117      0.987305      0.116211
2014-05-15 10:39         0.117188      0.984375      0.122070
2014-05-15 10:40         0.119141      0.987305      0.119141
2014-05-15 10:41         0.116211      0.984375      0.120117
2014-05-15 10:42         0.119141      0.983398      0.118164

Run Code Online (Sandbox Code Playgroud)

我想对每个也返回向量的行应用转换

def myfunc(a, b, c):
    do something
    return e, f, g

Run Code Online (Sandbox Code Playgroud)

但如果我这样做:

df.apply(myfunc, axis=1)

Run Code Online (Sandbox Code Playgroud)

我最终得到了一个Pandas系列,其元素是元组.这是因为申请将取得myfunc的结果而不解压缩它.如何更改myfunc以便我获得一个包含3列的新df？

编辑:

所有解决方案都起作用 Series系列解决方案允许列名,List解决方案似乎执行得更快.

def myfunc1(args):
    e=args[0] + 2*args[1]
    f=args[1]*args[2] +1
    g=args[2] + args[0] * args[1]
    return pd.Series([e,f,g], index=['a', 'b', 'c'])

def myfunc2(args):
    e=args[0] + 2*args[1]
    f=args[1]*args[2] +1
    g=args[2] + args[0] * args[1]
    return [e,f,g]

%timeit df.apply(myfunc1 ,axis=1)

100 loops, best of 3: 4.51 ms per loop

%timeit df.apply(myfunc2 ,axis=1)

100 loops, best of 3: 2.75 ms per loop

Run Code Online (Sandbox Code Playgroud)

Answer 1

U2E*_*EF1 45

返回Series并将它们放在DataFrame中.

def myfunc(a, b, c):
    do something
    return pd.Series([e, f, g])

Run Code Online (Sandbox Code Playgroud)

这样可以为每个结果列提供标签.如果返回DataFrame,则只为该组插入多行.

系列答案似乎是规范的答案.但是,在版本0.18.1上,系列解决方案比多次运行应用程序长约4倍. (4认同)
为了给这个不错的答案添加一点，可以进一步执行 `new_vars = ['e', 'f', 'g']` 和 `df[new_vars] = df.apply(my_func, axis=1)` (4认同)
在每次迭代中创建整个“pd.Series”不是效率极低吗？ (3认同)

Answer 2

Den*_*zov 18

基于@ U2EF1 的优秀答案,我创建了一个方便的函数,它应用指定的函数将元组返回到数据帧字段,并将结果扩展回数据帧.

def apply_and_concat(dataframe, field, func, column_names):
    return pd.concat((
        dataframe,
        dataframe[field].apply(
            lambda cell: pd.Series(func(cell), index=column_names))), axis=1)

Run Code Online (Sandbox Code Playgroud)

用法:

df = pd.DataFrame([1, 2, 3], index=['a', 'b', 'c'], columns=['A'])
print df
   A
a  1
b  2
c  3

def func(x):
    return x*x, x*x*x

print apply_and_concat(df, 'A', func, ['x^2', 'x^3'])

   A  x^2  x^3
a  1    1    1
b  2    4    8
c  3    9   27

Run Code Online (Sandbox Code Playgroud)

希望它可以帮助某人.

Answer 3

Gen*_*ito 10

我试过返回一个元组（我正在使用类似scipy.stats.pearsonr返回那种结构的函数），但它返回了一个 1D 系列而不是我期望的数据帧。如果我手动创建一个系列，性能会更差，所以我使用官方 API 文档result_type中的解释来修复它：

在函数内部返回一个 Series 类似于传递 result_type='expand'。结果列名将是系列索引。

所以你可以这样编辑你的代码：

def myfunc(a, b, c):
    # do something
    return (e, f, g)

df.apply(myfunc, axis=1,  result_type='expand')

Run Code Online (Sandbox Code Playgroud)

如果您希望在数据框中创建两个或三个（或n个）新列，您可以使用： `df['e'], d['f'], d['g'] = df.apply(myfunc, axis=1, result_type='expand').T.values` (3认同)

Answer 4

Fra*_*Fra 9

找到了一个可能的解决方案，通过改变 myfunc 返回一个 np.array 像这样：

import numpy as np

def myfunc(a, b, c):
    do something
    return np.array((e, f, g))

Run Code Online (Sandbox Code Playgroud)

任何更好的解决方案？

Answer 5

Hap*_*001 9

只需返回一个列表而不是元组.

In [81]: df
Out[81]: 
                            x         y         z
ts                                               
2014-05-15 10:38:00  0.120117  0.987305  0.116211
2014-05-15 10:39:00  0.117188  0.984375  0.122070
2014-05-15 10:40:00  0.119141  0.987305  0.119141
2014-05-15 10:41:00  0.116211  0.984375  0.120117
2014-05-15 10:42:00  0.119141  0.983398  0.118164

[5 rows x 3 columns]

In [82]: def myfunc(args):
   ....:        e=args[0] + 2*args[1]
   ....:        f=args[1]*args[2] +1
   ....:        g=args[2] + args[0] * args[1]
   ....:        return [e,f,g]
   ....: 

In [83]: df.apply(myfunc ,axis=1)
Out[83]: 
                            x         y         z
ts                                               
2014-05-15 10:38:00  2.094727  1.114736  0.234803
2014-05-15 10:39:00  2.085938  1.120163  0.237427
2014-05-15 10:40:00  2.093751  1.117629  0.236770
2014-05-15 10:41:00  2.084961  1.118240  0.234512
2014-05-15 10:42:00  2.085937  1.116202  0.235327

Run Code Online (Sandbox Code Playgroud)

这不起作用.它返回一个系列,其元素是列表.我在熊猫0.18.1上 (11认同)

Answer 6

Kei*_*iku 6

其他人的一些答案包含错误，所以我在下面总结了它们。完美答案如下。

准备数据集。pandas 的版本使用1.1.5.

import numpy as np
import pandas as pd
import timeit

# check pandas version
print(pd.__version__)
# 1.1.5

# prepare DataFrame
df = pd.DataFrame({
    'x': [0.120117, 0.117188, 0.119141, 0.116211, 0.119141],
    'y': [0.987305, 0.984375, 0.987305, 0.984375, 0.983398],
    'z': [0.116211, 0.122070, 0.119141, 0.120117, 0.118164]},
    index=[
        '2014-05-15 10:38',
        '2014-05-15 10:39',
        '2014-05-15 10:40',
        '2014-05-15 10:41',
        '2014-05-15 10:42'],
    columns=['x', 'y', 'z'])
df.index.name = 'ts'
#                          x         y         z
# ts                                            
# 2014-05-15 10:38  0.120117  0.987305  0.116211
# 2014-05-15 10:39  0.117188  0.984375  0.122070
# 2014-05-15 10:40  0.119141  0.987305  0.119141
# 2014-05-15 10:41  0.116211  0.984375  0.120117
# 2014-05-15 10:42  0.119141  0.983398  0.118164

Run Code Online (Sandbox Code Playgroud)

解决方案 01。

pd.Series在应用函数中返回。

def myfunc1(args):
    e = args[0] + 2*args[1]
    f = args[1]*args[2] + 1
    g = args[2] + args[0] * args[1]
    return pd.Series([e, f, g])

df[['e', 'f', 'g']] = df.apply(myfunc1, axis=1)
#                          x         y         z         e         f         g
# ts
# 2014-05-15 10:38  0.120117  0.987305  0.116211  2.094727  1.114736  0.234803
# 2014-05-15 10:39  0.117188  0.984375  0.122070  2.085938  1.120163  0.237427
# 2014-05-15 10:40  0.119141  0.987305  0.119141  2.093751  1.117629  0.236770
# 2014-05-15 10:41  0.116211  0.984375  0.120117  2.084961  1.118240  0.234512
# 2014-05-15 10:42  0.119141  0.983398  0.118164  2.085937  1.116202  0.235327

t1 = timeit.timeit(
    'df.apply(myfunc1, axis=1)',
    globals=dict(df=df, myfunc1=myfunc1), number=10000)
print(round(t1, 3), 'seconds')
# 14.571 seconds

Run Code Online (Sandbox Code Playgroud)

解决方案 02。

使用result_type ='expand'申请时。

def myfunc2(args):
    e = args[0] + 2*args[1]
    f = args[1]*args[2] + 1
    g = args[2] + args[0] * args[1]
    return [e, f, g]

df[['e', 'f', 'g']] = df.apply(myfunc2, axis=1, result_type='expand')
#                          x         y         z         e         f         g
# ts                                                                          
# 2014-05-15 10:38  0.120117  0.987305  0.116211  2.094727  1.114736  0.234803
# 2014-05-15 10:39  0.117188  0.984375  0.122070  2.085938  1.120163  0.237427
# 2014-05-15 10:40  0.119141  0.987305  0.119141  2.093751  1.117629  0.236770
# 2014-05-15 10:41  0.116211  0.984375  0.120117  2.084961  1.118240  0.234512
# 2014-05-15 10:42  0.119141  0.983398  0.118164  2.085937  1.116202  0.235327

t2 = timeit.timeit(
    "df.apply(myfunc2, axis=1, result_type='expand')",
    globals=dict(df=df, myfunc2=myfunc2), number=10000)
print(round(t2, 3), 'seconds')
# 9.907 seconds

Run Code Online (Sandbox Code Playgroud)

解决方案 03。

如果你想让它更快，请使用np.vectorize. 请注意，使用 args 时不能是单个参数np.vectorize。

def myfunc3(args0, args1, args2):
    e = args0 + 2*args1
    f = args1*args2 + 1
    g = args2 + args0 * args1
    return [e, f, g]

df[['e', 'f', 'g']] = pd.DataFrame(np.row_stack(np.vectorize(myfunc3, otypes=['O'])(df['x'], df['y'], df['z'])), index=df.index)
#                          x         y         z         e         f         g
# ts                                                                          
# 2014-05-15 10:38  0.120117  0.987305  0.116211  2.094727  1.114736  0.234803
# 2014-05-15 10:39  0.117188  0.984375  0.122070  2.085938  1.120163  0.237427
# 2014-05-15 10:40  0.119141  0.987305  0.119141  2.093751  1.117629  0.236770
# 2014-05-15 10:41  0.116211  0.984375  0.120117  2.084961  1.118240  0.234512
# 2014-05-15 10:42  0.119141  0.983398  0.118164  2.085937  1.116202  0.235327

t3 = timeit.timeit(
    "pd.DataFrame(np.row_stack(np.vectorize(myfunc3, otypes=['O'])(df['x'], df['y'], df['z'])), index=df.index)",
    globals=dict(pd=pd, np=np, df=df, myfunc3=myfunc3), number=10000)
print(round(t3, 3), 'seconds')
# 1.598 seconds

Run Code Online (Sandbox Code Playgroud)

Answer 7

Rac*_*lom 5

Pandas 1.0.5 具有带有参数result_type 的DataFrame.apply ，可以在此处提供帮助。\n来自文档：

\n

These only act when axis=1 (columns):\n\n\xe2\x80\x98expand\xe2\x80\x99 : list-like results will be turned into columns.\n\n \xe2\x80\x98reduce\xe2\x80\x99 : returns a Series if possible rather than expanding list-like results. This \n is the opposite of \xe2\x80\x98expand\xe2\x80\x99.\n\n\xe2\x80\x98broadcast\xe2\x80\x99 : results will be broadcast to the original shape of the DataFrame, the \noriginal index and columns will be retained.\n

Run Code Online (Sandbox Code Playgroud)\n

归档时间：	11 年，8 月前
查看次数：	45135 次
最近记录：	8 年，7 月前