我正在做一个项目,要求我对文档进行排序以匹配主题.
例如,我有4个主题,讲座,导师,实验室和考试.我有一些句子是:
现在我想把这些句子分成上面的主题,结果应该是:
我做了研究,我发现的最多指令是使用LDA主题建模.但似乎无法解决我的问题因为我知道LDA支持识别文档中的主题,并且不知道如何手动预选主题.
有人可以帮帮我吗?我坚持这一点.
这是使用比字符串匹配更聪明的东西的优秀示例=)
让我们考虑一下:
有没有办法将每个单词转换为矢量形式(即浮点数组)?
有没有办法将每个句子转换为相同的矢量形式(即一个浮点数组与单词的矢量形式相同的维度?
首先让我们在你的句子列表中找到所有可能的词汇(我们称之为语料库):
>>> from itertools import chain
>>> s1 = "Lecture was engaging"
>>> s2 = "Tutor is very nice and active"
>>> s3 = "The content of lecture was too much for 2 hours."
>>> s4 = "Exam seem to be too difficult compare with weekly lab."
>>> list(map(word_tokenize, [s1, s2, s3, s4]))
[['Lecture', 'was', 'engaging'], ['Tutor', 'is', 'very', 'nice', 'and', 'active'], ['The', 'content', 'of', 'lecture', 'was', 'too', 'much', 'for', '2', 'hours', '.'], ['Exam', 'seem', 'to', 'be', 'too', 'difficult', 'compare', 'with', 'weekly', 'lab', '.']]
>>> vocab = sorted(set(token.lower() for token in chain(*list(map(word_tokenize, [s1, s2, s3, s4])))))
>>> vocab
['.', '2', 'active', 'and', 'be', 'compare', 'content', 'difficult', 'engaging', 'exam', 'for', 'hours', 'is', 'lab', 'lecture', 'much', 'nice', 'of', 'seem', 'the', 'to', 'too', 'tutor', 'very', 'was', 'weekly', 'with']
Run Code Online (Sandbox Code Playgroud)
现在让我们通过使用词汇表中单词的索引将4个关键词表示为向量:
>>> lecture = [1 if token == 'lecture' else 0 for token in vocab]
>>> lab = [1 if token == 'lab' else 0 for token in vocab]
>>> tutor = [1 if token == 'tutor' else 0 for token in vocab]
>>> exam = [1 if token == 'exam' else 0 for token in vocab]
>>> lecture
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
>>> lab
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
>>> tutor
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
>>> exam
[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Run Code Online (Sandbox Code Playgroud)
类似地,我们遍历每个句子并将它们转换为矢量形式:
>>> [token.lower() for token in word_tokenize(s1)]
['lecture', 'was', 'engaging']
>>> s1_tokens = [token.lower() for token in word_tokenize(s1)]
>>> s1_vec = [1 if token in s1_tokens else 0 for token in vocab]
>>> s1_vec
[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
Run Code Online (Sandbox Code Playgroud)
对所有句子重复相同的操作:
>>> s2_tokens = [token.lower() for token in word_tokenize(s2)]
>>> s3_tokens = [token.lower() for token in word_tokenize(s3)]
>>> s4_tokens = [token.lower() for token in word_tokenize(s4)]
>>> s2_vec = [1 if token in s2_tokens else 0 for token in vocab]
>>> s3_vec = [1 if token in s3_tokens else 0 for token in vocab]
>>> s4_vec = [1 if token in s4_tokens else 0 for token in vocab]
>>> s2_vec
[0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0]
>>> s3_vec
[1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0]
>>> s4_vec
[1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1]
Run Code Online (Sandbox Code Playgroud)
现在,给定句子和单词的矢量形式,你可以使用相似性函数,例如余弦相似度:
>>> from numpy import dot
>>> from numpy.linalg import norm
>>>
>>> cos_sim = lambda x, y: dot(x,y)/(norm(x)*norm(y))
>>> cos_sim(s1_vec, lecture)
0.5773502691896258
>>> cos_sim(s1_vec, lab)
0.0
>>> cos_sim(s1_vec, exam)
0.0
>>> cos_sim(s1_vec, tutor)
0.0
Run Code Online (Sandbox Code Playgroud)
现在,更系统地做:
>>> topics = {'lecture': lecture, 'lab': lab, 'exam': exam, 'tutor':tutor}
>>> topics
{'lecture': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
'lab': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
'exam': [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
'tutor': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]}
>>> sentences = {'s1':s1_vec, 's2':s2_vec, 's3':s3_vec, 's4':s4_vec}
>>> for s_num, s_vec in sentences.items():
... print(s_num)
... for name, topic_vec in topics.items():
... print('\t', name, cos_sim(s_vec, topic_vec))
...
s1
lecture 0.5773502691896258
lab 0.0
exam 0.0
tutor 0.0
s2
lecture 0.0
lab 0.0
exam 0.0
tutor 0.4082482904638631
s3
lecture 0.30151134457776363
lab 0.0
exam 0.0
tutor 0.0
s4
lecture 0.0
lab 0.30151134457776363
exam 0.30151134457776363
tutor 0.0
Run Code Online (Sandbox Code Playgroud)
我想你明白了.但是我们看到分数仍然与s4-lab vs s4-exam相关.所以问题就变成了,"有没有办法让它们发散?" 你会跳进兔子洞:
如何最好地将句子/单词表示为固定大小的向量?
用什么相似性函数来比较"主题"/单词与句子?
什么是"主题"?矢量实际代表什么?
上面的答案是通常所说的单热矢量来表示单词/句子.除了简单地比较字符串以"识别与主题相关的句子?"之外,还有很多复杂性.(又名文件聚类/分类).例如,文件/句子可以有多个主题吗?
请查看这些关键词,以进一步了解"自然语言处理","文档分类","机器学习"等问题.与此同时,如果你不介意的话,我猜这个问题很接近"过于宽泛".
| 归档时间: |
|
| 查看次数: |
182 次 |
| 最近记录: |