我们如何根据分数将成对映射的字符串转换为多组字符串?

Yos*_*nti 2 python grouping pandas

如果单词的成对映射得分超过 0.5,则将它们分组在一起。如果组中任何其他得分超过 0.5 的关键字,则将该关键字添加到该组中。

例子:

输入:

word1              word2       score
hello           hello world    0.75
hello world     hi world       0.555
hello           hi world       0
good morning    hello          0
good morning    morning        0.75
morning         hello          0
morning         hello world    0
morning         hi world       0
good morning    hello world    0
good morning    hi world       0   
Run Code Online (Sandbox Code Playgroud)

输出:

word                 group
hello                 1
hello world           1
hi world              1
good morning          2
morning               2
Run Code Online (Sandbox Code Playgroud)

jez*_*ael 5

首先按boolean indexing和过滤行Series.gt

df1 = df[df['score'].gt(0.5)]
print (df1)
          word1        word2  score
0         hello  hello world  0.750
1   hello world     hi world  0.555
4  good morning      morning  0.750
Run Code Online (Sandbox Code Playgroud)

networkxconnected_components字典一起使用:

import networkx as nx

# Create the graph from the dataframe
g = nx.Graph()
g.add_edges_from(df1[['word1','word2']].itertuples(index=False))

connected_components = nx.connected_components(g)

# Find the component id of the nodes
node2id = {}
for cid, component in enumerate(connected_components):
    for node in component:
        node2id[node] = cid + 1
Run Code Online (Sandbox Code Playgroud)

最后一次整形依据DataFrame.stack、删除重复项Series.drop_duplicates以及最后一次使用Series.map新列:

df2 = df1[['word1','word2']].stack().drop_duplicates().reset_index(drop=True).to_frame('word')
df2['group'] = df2['word'].map(node2id)
print (df2)
           word  group
0         hello      1
1   hello world      1
2      hi world      1
3  good morning      2
4       morning      2
Run Code Online (Sandbox Code Playgroud)