我写了以下正则表达式来标记某些短语模式
pattern = """
P2: {<JJ>+ <RB>? <JJ>* <NN>+ <VB>* <JJ>*}
P1: {<JJ>? <NN>+ <CC>? <NN>* <VB>? <RB>* <JJ>+}
P3: {<NP1><IN><NP2>}
P4: {<NP2><IN><NP1>}
"""
Run Code Online (Sandbox Code Playgroud)
此模式将正确标记短语,例如:
a = 'The pizza was good but pasta was bad'
Run Code Online (Sandbox Code Playgroud)
并提供2个短语的所需输出:
但是,如果我的句子是这样的:
a = 'The pizza was awesome and brilliant'
Run Code Online (Sandbox Code Playgroud)
仅匹配短语:
'pizza was awesome'
Run Code Online (Sandbox Code Playgroud)
而不是所期望的:
'pizza was awesome and brilliant'
Run Code Online (Sandbox Code Playgroud)
如何在我的第二个例子中加入正则表达式模式?
给出一个输入句子,它有BIO块标签:
[('什么','B-NP'),('是','B-VP'),(''','B-NP'),(''airspeed','I-NP'),( 'of','B-PP'),('an','B-NP'),('unladen','I-NP'),('swallow','I-NP'),('? ','O')]
我需要提取出相关的短语,例如,如果我想提取'NP',我需要提取包含B-NP和的元组的片段I-NP.
[OUT]:
[('What', '0'), ('the airspeed', '2-3'), ('an unladen swallow', '5-6-7')]
Run Code Online (Sandbox Code Playgroud)
(注意:提取元组中的数字代表令牌索引.)
我尝试使用以下代码解压缩它:
def extract_chunks(tagged_sent, chunk_type):
current_chunk = []
current_chunk_position = []
for idx, word_pos in enumerate(tagged_sent):
word, pos = word_pos
if '-'+chunk_type in pos: # Append the word to the current_chunk.
current_chunk.append((word))
current_chunk_position.append((idx))
else:
if current_chunk: # Flush the full chunk when out of an NP.
_chunk_str = ' '.join(current_chunk)
_chunk_pos_str = '-'.join(map(str, current_chunk_position))
yield _chunk_str, _chunk_pos_str …Run Code Online (Sandbox Code Playgroud) 我有一个涉及大量文本数据的机器学习任务。我想在训练文本中识别并提取名词短语,以便稍后在管道中将其用于特征构建。我已经从文本中提取了我想要的名词短语的类型,但是我对NLTK还是很陌生,所以我以一种可以分解列表理解的每一步的方式来解决这个问题,如下所示。
但是我真正的问题是,我在这里重塑车轮吗?有没有我看不到的更快的方法?
import nltk
import pandas as pd
myData = pd.read_excel("\User\train_.xlsx")
texts = myData['message']
# Defining a grammar & Parser
NP = "NP: {(<V\w+>|<NN\w?>)+.*<NN\w?>}"
chunkr = nltk.RegexpParser(NP)
tokens = [nltk.word_tokenize(i) for i in texts]
tag_list = [nltk.pos_tag(w) for w in tokens]
phrases = [chunkr.parse(sublist) for sublist in tag_list]
leaves = [[subtree.leaves() for subtree in tree.subtrees(filter = lambda t: t.label == 'NP')] for tree in phrases]
Run Code Online (Sandbox Code Playgroud)
将我们最终得到的元组列表的列表扁平化为仅元组列表的列表
leaves = [tupls for sublists in leaves for tupls in sublists]
Run Code Online (Sandbox Code Playgroud)
将提取的术语加入一个二元组
nounphrases = …Run Code Online (Sandbox Code Playgroud)