如何在pandas数据帧中从groupby的结果生成所有值对

BKS*_*BKS 8 python combinations python-2.7 pandas

我有一个pandas数据帧df:

ID     words
1      word1
1      word2
1      word3
2      word4
2      word5
3      word6
3      word7
3      word8
3      word9
Run Code Online (Sandbox Code Playgroud)

我想生成另一个数据帧,它将生成每个组中的所有单词对.所以上面的结果将是:

ID     wordA    wordB
1      word1    word2
1      word1    word3
1      word2    word3
2      word4    word5
3      word6    word7
3      word6    word8
3      word6    word9
3      word7    word8
3      word7    word9
3      word8    word9
Run Code Online (Sandbox Code Playgroud)

我知道我可以用来df.groupby['words']获取每个内容ID.

我也知道我可以用

iterable = ['word1','word2','word3']
list(itertools.combinations(iterable, 2))
Run Code Online (Sandbox Code Playgroud)

获得所有可能的成对组合.但是,如上所示,我对生成结果数据帧的最佳方法有点迷失.

jez*_*ael 6

您可以将groupbywith apply与return 一起使用DataFrame,最后一次添加reset_index用于删除第二级,然后从索引创建列:

from itertools import combinations

f = lambda x : pd.DataFrame(list(combinations(x.values,2)), 
                            columns=['wordA','wordB'])
df = (df.groupby('ID')['words'].apply(f)
                               .reset_index(level=1, drop=True)
                               .reset_index())
print (df)
   ID  wordA  wordB
0   1  word1  word2
1   1  word1  word3
2   1  word2  word3
3   2  word4  word5
4   3  word6  word7
5   3  word6  word8
6   3  word6  word9
7   3  word7  word8
8   3  word7  word9
9   3  word8  word9
Run Code Online (Sandbox Code Playgroud)


Flo*_*oor 5

它易于应用的itertools组合在apply和stack中使用,即

from itertools import combinations
ndf = df.groupby('ID')['words'].apply(lambda x : list(combinations(x.values,2)))
                          .apply(pd.Series).stack().reset_index(level=0,name='words')

 ID           words
0   1  (word1, word2)
1   1  (word1, word3)
2   1  (word2, word3)
0   2  (word4, word5)
0   3  (word6, word7)
1   3  (word6, word8)
2   3  (word6, word9)
3   3  (word7, word8)
4   3  (word7, word9)
5   3  (word8, word9)
Run Code Online (Sandbox Code Playgroud)

为了进一步匹配您的确切输出,我们必须做

sdf = pd.concat([ndf['ID'],ndf['words'].apply(pd.Series)],1).set_axis(['ID','WordsA','WordsB'],1,inplace=False)

   ID WordsA WordsB
0   1  word1  word2
1   1  word1  word3
2   1  word2  word3
0   2  word4  word5
0   3  word6  word7
1   3  word6  word8
2   3  word6  word9
3   3  word7  word8
4   3  word7  word9
5   3  word8  word9
Run Code Online (Sandbox Code Playgroud)

要将其转换为一行,我们可以执行以下操作:

combo = df.groupby('ID')['words'].apply(combinations,2)\
                     .apply(list).apply(pd.Series)\
                     .stack().apply(pd.Series)\
                     .set_axis(['WordsA','WordsB'],1,inplace=False)\
                     .reset_index(level=0)
Run Code Online (Sandbox Code Playgroud)