Pandas:df_left.merge(df_right) 摘要统计

san*_*tle 5 python pandas

对于 Pandas:df.merge()方法,是他们获取合并摘要统计信息(例如匹配数、不匹配数等)的便捷方法。我知道这些统计数据取决于标志how='inner',但是知道使用内部联接等时“丢弃”了多少内容会很方便。我可以简单地使用:

df = df_left.merge(df_right, on='common_column', how='inner')
set1 = set(df_left[common_column].unique())
set2 = set(df_right[common_column].unique())
set1.issubset(set2)   #True No Further Analysis Required
set2.issubset(set1)   #False
num_shared = len(set2.intersection(set1))
num_diff = len(set2.difference(set1))
# And So on ...
Run Code Online (Sandbox Code Playgroud)

但认为这可能已经实施了。我是否错过了它(即类似report=True会返回的合并new_dataframe以及报告系列或数据框)

小智 1

尝试这个函数...然后你可以像这样将你的参数传递给它:

df = merge_like_stata(df1, df2, mergevars)
Run Code Online (Sandbox Code Playgroud)

函数定义:

def merge_like_stata(master, using, mergevars):
    master['_master_merge_'] = 'master'
    using['_using_merge_'] = 'using'
    df = pd.merge(master, using, on=mergevars, how='outer')
    df['_master_merge_'] = df['_master_merge_'].apply(lambda x: 'miss' if pd.isnull(x) else x)
    df['_using_merge_'] = df['_using_merge_'].apply(lambda x: 'miss' if pd.isnull(x) else x)
    df['_merge'] = df.apply(lambda row: '3 - Master Only' if row['_master_merge_']=='master' and row['_using_merge_'] =='using' else None, axis=1)
    df['_merge'] = df.apply(lambda row: '2 - Master Only' if row['_master_merge_']=='master' and row['_using_merge_'] =='miss' else row['_merge'], axis=1)
    df['_merge'] = df.apply(lambda row: '1 - Using Only' if row['_master_merge_']=='miss' and row['_using_merge_'] =='using' else row['_merge'], axis=1)
    df['column']="Count"
    pd.crosstab(df._merge, df.column, margins=True)
    df = df.drop(['_master_merge_', '_using_merge_'], axis=1)
    return print(pd.crosstab(df._merge, df.column, margins=True))
    return df
Run Code Online (Sandbox Code Playgroud)