逐行逐个比较2个Pandas数据帧

Question

逐行逐个比较2个Pandas数据帧

Zub*_*ubo 3 python iteration iterator dataframe pandas

我有2个数据帧,df1并df2希望执行以下操作,将结果存储在df3:

for each row in df1:

    for each row in df2:

        create a new row in df3 (called "df1-1, df2-1" or whatever) to store results 

        for each cell(column) in df1: 

            for the cell in df2 whose column name is the same as for the cell in df1:

                compare the cells (using some comparing function func(a,b) ) and, 
                depending on the result of the comparison, write result into the 
                appropriate column of the "df1-1, df2-1" row of df3)

Run Code Online (Sandbox Code Playgroud)

例如,类似于:

df1
A   B    C      D
foo bar  foobar 7
gee whiz herp   10

df2
A   B   C      D
zoo car foobar 8

df3
df1-df2 A             B              C                   D
foo-zoo func(foo,zoo) func(bar,car)  func(foobar,foobar) func(7,8)
gee-zoo func(gee,zoo) func(whiz,car) func(herp,foobar)   func(10,8)

Run Code Online (Sandbox Code Playgroud)

我从这开始:

for r1 in df1.iterrows():
    for r2 in df2.iterrows():
        for c1 in r1:
            for c2 in r2:

Run Code Online (Sandbox Code Playgroud)

但我不知道该怎么做,并希望得到一些帮助.

Answer 1

Sta*_*Fox 5

因此,为了继续评论中的讨论,您可以使用矢量化,这是像熊猫或numpy这样的库的卖点之一.理想情况下,你不应该打电话iterrows().根据我的建议更明确一点:

# with df1 and df2 provided as above, an example
df3 = df1['A'] * 3 + df2['A']

# recall that df2 only has the one row so pandas will broadcast a NaN there
df3
0    foofoofoozoo
1             NaN
Name: A, dtype: object

# more generally

# we know that df1 and df2 share column names, so we can initialize df3 with those names
df3 = pd.DataFrame(columns=df1.columns) 
for colName in df1:
    df3[colName] = func(df1[colName], df2[colName])

Run Code Online (Sandbox Code Playgroud)

现在,您甚至可以通过创建lambda函数然后使用列名称压缩它们来将不同的函数应用于不同的列:

# some example functions
colAFunc = lambda x, y: x + y
colBFunc = lambda x, y; x - y
....
columnFunctions = [colAFunc, colBFunc, ...]

# initialize df3 as above
df3 = pd.DataFrame(columns=df1.columns)
for func, colName in zip(columnFunctions, df1.columns):
    df3[colName] = func(df1[colName], df2[colName])

Run Code Online (Sandbox Code Playgroud)

想到的唯一"问题"是您需要确保您的函数适用于列中的数据.例如,如果您要执行某些操作df1['A'] - df2['A'](使用df1,df2,就像您提供的那样),那么会增加一个,ValueError因为两个字符串的减法是未定义的.只是需要注意的事情.

编辑,重新:你的评论:这也是可行的.迭代更大的dfX.columns,所以你不要遇到a KeyError,并if在那里抛出一个语句:

# all the other jazz
# let's say df1 is [['A', 'B', 'C']] and df2 is [['A', 'B', 'C', 'D']]
# so iterate over df2 columns
for colName in df2:
    if colName not in df1:
        df3[colName] = np.nan # be sure to import numpy as np
    else:
        df3[colName] = func(df1[colName], df2[colName])

Run Code Online (Sandbox Code Playgroud)

归档时间：	9 年，5 月前
查看次数：	2257 次
最近记录：	9 年，5 月前