sti*_*ing 9 python nlp nltk networkx word2vec
我有几个主题(两个)的句子列表,如下所示:
Sentences
Trump says that it is useful to win the next presidential election.
The Prime Minister suggests the name of the winner of the next presidential election.
In yesterday's conference, the Prime Minister said that it is very important to win the next presidential election.
The Chinese Minister is in London to discuss about climate change.
The president Donald Trump states that he wants to win the presidential election. This will require a strong media engagement.
The president Donald Trump states that he wants to win the presidential election. The UK has proposed collaboration.
The president Donald Trump states that he wants to win the presidential election. He has the support of his electors.
Run Code Online (Sandbox Code Playgroud)
如您所见,句子之间存在相似性。
我试图通过使用图形(有向)将多个句子联系起来并可视化它们的特征。该图是根据相似性矩阵构建的,通过应用如上所示的句子的行顺序。我创建了一个新列 Time 来显示句子的顺序,所以第一行(特朗普说....)是时间 1;第二行(总理建议...)在时间 2,依此类推。像这样的东西
Time Sentences
1 Trump said that it is useful to win the next presidential election.
2 The Prime Minister suggests the name of the winner of the next presidential election.
3 In today's conference, the Prime Minister said that it is very important to win the next presidential election.
...
Run Code Online (Sandbox Code Playgroud)
我想然后找到关系,以便对该主题有一个清晰的概述。一个句子的多个路径将表明有多个与之相关的信息。为了确定两个句子之间的相似性,我尝试提取名词和动词如下:
noun=[]
verb=[]
for index, row in df.iterrows():
nouns.append([word for word,pos in pos_tag(row[0]) if pos == 'NN'])
verb.append([word for word,pos in pos_tag(row[0]) if pos == 'VB'])
Run Code Online (Sandbox Code Playgroud)
因为它们是任何句子中的关键字。因此,当一个关键字(名词或动词)出现在句子 x 中而不出现在其他句子中时,它代表了这两个句子之间的差异。但是,我认为更好的方法是使用 word2vec 或 gensim (WMD)。
必须为每个句子计算这种相似度。我想构建一个图表,显示上面示例中句子的内容。由于有两个主题(特朗普和中国部长),我需要为每个主题寻找子主题。例如,特朗普有副主题总统选举。我图中的一个节点应该代表一个句子。每个节点中的单词代表句子的差异,显示句子中的新信息。例如,这个词states在时间 5 的句子中位于时间 6 和 7 的相邻句子中。我只想找到一种方法来获得类似的结果,如下图所示。我曾尝试主要使用名词和动词提取,但可能这不是正确的方法。我试图做的是考虑时间 1 的句子并将其与其他句子进行比较,分配相似度分数(使用名词和动词提取,但也使用 word2vec),并对所有其他句子重复它。但我现在的问题是如何提取差异来创建一个有意义的图表。
对于图的一部分,我会考虑使用 networkx (DiGraph):
G = nx.DiGraph()
N = Network(directed=True)
Run Code Online (Sandbox Code Playgroud)
显示关系的方向。
我提供了一个不同的例子来使它更清楚(但如果你使用前面的例子,它也会很好。对于给您带来的不便表示歉意,但由于我的第一个问题不是那么清楚,我还必须提供一个更好的,可能更容易,例如)。
没有实现动词/名词分离的 NLP,只是添加了一个好词列表。使用spacy可以相对容易地提取和标准化它们。请注意,walk出现在 1,2,5 个句子中,并形成一个三元组。
import re
import networkx as nx
import matplotlib.pyplot as plt
plt.style.use("ggplot")
sentences = [
"I went out for a walk or walking.",
"When I was walking, I saw a cat. ",
"The cat was injured. ",
"My mum's name is Marylin.",
"While I was walking, I met John. ",
"Nothing has happened.",
]
G = nx.Graph()
# set of possible good words
good_words = {"went", "walk", "cat", "walking"}
# remove punctuation and keep only good words inside sentences
words = list(
map(
lambda x: set(re.sub(r"[^\w\s]", "", x).lower().split()).intersection(
good_words
),
sentences,
)
)
# convert sentences to dict for furtehr labeling
sentences = {k: v for k, v in enumerate(sentences)}
# add nodes
for i, sentence in sentences.items():
G.add_node(i)
# add edges if two nodes have the same word inside
for i in range(len(words)):
for j in range(i + 1, len(words)):
for edge_label in words[i].intersection(words[j]):
G.add_edge(i, j, r=edge_label)
# compute layout coords
coord = nx.spring_layout(G)
plt.figure(figsize=(20, 14))
# set label coords a bit upper the nodes
node_label_coords = {}
for node, coords in coord.items():
node_label_coords[node] = (coords[0], coords[1] + 0.04)
# draw the network
nodes = nx.draw_networkx_nodes(G, pos=coord)
edges = nx.draw_networkx_edges(G, pos=coord)
edge_labels = nx.draw_networkx_edge_labels(G, pos=coord)
node_labels = nx.draw_networkx_labels(G, pos=node_label_coords, labels=sentences)
plt.title("Sentences network")
plt.axis("off")
Run Code Online (Sandbox Code Playgroud)
更新
如果你想衡量不同句子之间的相似度,你可能想计算句子嵌入之间的差异。
这使您有机会找到具有不同单词的句子之间的语义相似性,例如“有多位男性参加的足球比赛”和“一些男性正在参加一项运动”。使用 BERT 的几乎 SoTA 方法可以在这里找到,更简单的方法在这里。
由于您有相似性度量,因此仅当相似性度量大于某个阈值时,只需替换 add_edge 块即可添加新边缘。生成的添加边代码将如下所示:
# add edges if two nodes have the same word inside
tresold = 0.90
for i in range(len(words)):
for j in range(i + 1, len(words)):
# suppose you have some similarity function using BERT or PCA
similarity = check_similarity(sentences[i], sentences[j])
if similarity > tresold:
G.add_edge(i, j, r=similarity)
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
392 次 |
| 最近记录: |