Jer*_*rry 5 python parsing nlp sentiment-analysis
我正在对给定的文件进行情感分析,我的目标是我想在我的句子中找出与目标短语相关的最接近或周围的形容词.我确实知道如何提取与目标短语相关的周围词,但我如何找到与目标短语相对接近或最接近的形容词或NNP/ VBN或其他POS标签.
这是关于如何使周围的单词尊重我的目标短语的草图概念.
sentence_List= {"Obviously one of the most important features of any computer is the human interface.", "Good for everyday computing and web browsing.",
"My problem was with DELL Customer Service", "I play a lot of casual games online[comma] and the touchpad is very responsive"}
target_phraseList={"human interface","everyday computing","DELL Customer Service","touchpad"}
Run Code Online (Sandbox Code Playgroud)
请注意,我的原始数据集是作为数据框给出的,其中给出了句子列表和相应的目标短语.这里我只是模拟数据如下:
import pandas as pd
df=pd.Series(sentence_List, target_phraseList)
df=pd.DataFrame(df)
Run Code Online (Sandbox Code Playgroud)
在这里我将句子标记为如下:
from nltk.tokenize import word_tokenize
tokenized_sents = [word_tokenize(i) for i in sentence_List]
tokenized=[i for i in tokenized_sents]
Run Code Online (Sandbox Code Playgroud)
然后我试着通过在这里使用这个战利品找出对我的目标短语的周围的话.但是,我想找出相对更近或壁橱adjective,verbs或VBN尊重我的目标短语.我怎样才能做到这一点?有没有想过要做到这一点?谢谢
像下面这样的东西对你有用吗?我认识到需要进行一些调整才能使其完全有用(检查大写/小写;如果存在平局,它还会返回句子中前面的单词,而不是后面的单词),但希望它有用足以让你开始:
import nltk
from nltk.tokenize import MWETokenizer
def smart_tokenizer(sentence, target_phrase):
"""
Tokenize a sentence using a full target phrase.
"""
tokenizer = MWETokenizer()
target_tuple = tuple(target_phrase.split())
tokenizer.add_mwe(target_tuple)
token_sentence = nltk.pos_tag(tokenizer.tokenize(sentence.split()))
# The MWETokenizer puts underscores to replace spaces, for some reason
# So just identify what the phrase has been converted to
temp_phrase = target_phrase.replace(' ', '_')
target_index = [i for i, y in enumerate(token_sentence) if y[0] == temp_phrase]
if len(target_index) == 0:
return None, None
else:
return token_sentence, target_index[0]
def search(text_tag, tokenized_sentence, target_index):
"""
Search for a part of speech (POS) nearest a target phrase of interest.
"""
for i, entry in enumerate(tokenized_sentence):
# entry[0] is the word; entry[1] is the POS
ahead = target_index + i
behind = target_index - i
try:
if (tokenized_sentence[ahead][1]) == text_tag:
return tokenized_sentence[ahead][0]
except IndexError:
try:
if (tokenized_sentence[behind][1]) == text_tag:
return tokenized_sentence[behind][0]
except IndexError:
continue
x, i = smart_tokenizer(sentence='My problem was with DELL Customer Service',
target_phrase='DELL Customer Service')
print(search('NN', x, i))
y, j = smart_tokenizer(sentence="Good for everyday computing and web browsing.",
target_phrase="everyday computing")
print(search('NN', y, j))
Run Code Online (Sandbox Code Playgroud)
编辑:我做了一些更改来解决使用任意长度目标短语的问题,正如您在smart_tokenizer函数中看到的那样。关键是类nltk.tokenize.MWETokenizer(有关更多信息,请参阅:Python:使用短语进行标记)。希望这有帮助。spaCy顺便说一句,我会挑战必然更优雅的想法- 在某些时候,必须有人编写代码才能完成工作。这要么是spaCy开发人员,要么是您在推出自己的解决方案时。他们的 API 相当复杂,所以我将这个练习留给您。