Dan*_*rke 2 python regex split
我试图将一个样本文本拆分成一个没有分隔符的句子列表,每个句子末尾没有空格.
示范文本:
你第一次看到第二次文艺复兴时,它可能看起来很无聊.至少看两次,绝对看第2部分.它会改变你对矩阵的看法.人类是开始战争的人吗?AI是一件坏事吗?
进入此(所需输出):
['The first time you see The Second Renaissance it may look boring', 'Look at it at least twice and definitely watch part 2', 'It will change your view of the matrix', 'Are the human people the ones who started the war', 'Is AI a bad thing']
Run Code Online (Sandbox Code Playgroud)
我的代码目前是:
def sent_tokenize(text):
sentences = re.split(r"[.!?]", text)
sentences = [sent.strip(" ") for sent in sentences]
return sentences
Run Code Online (Sandbox Code Playgroud)
但是这个输出(电流输出):
['The first time you see The Second Renaissance it may look boring', 'Look at it at least twice and definitely watch part 2', 'It will change your view of the matrix', 'Are the human people the ones who started the war', 'Is AI a bad thing', '']
Run Code Online (Sandbox Code Playgroud)
注意最后的额外''.
关于如何在当前输出结束时删除额外''的任何想法?
cs9*_*s95 10
nltk的 sent_tokenize如果您从事NLP业务,我强烈建议您sent_tokenize从nltk包中购买.
>>> from nltk.tokenize import sent_tokenize
>>> sent_tokenize(text)
[
'The first time you see The Second Renaissance it may look boring.',
'Look at it at least twice and definitely watch part 2.',
'It will change your view of the matrix.',
'Are the human people the ones who started the war?',
'Is AI a bad thing?'
]
Run Code Online (Sandbox Code Playgroud)
它比正则表达式更强大,并提供了很多选项来完成工作.更多信息可以在官方文档中找到.
如果你对尾随分隔符很挑剔,你可以使用nltk.tokenize.RegexpTokenizer稍微不同的模式:
>>> from nltk.tokenize import RegexpTokenizer
>>> tokenizer = RegexpTokenizer(r'[^.?!]+')
>>> list(map(str.strip, tokenizer.tokenize(text)))
[
'The first time you see The Second Renaissance it may look boring',
'Look at it at least twice and definitely watch part 2',
'It will change your view of the matrix',
'Are the human people the ones who started the war',
'Is AI a bad thing'
]
Run Code Online (Sandbox Code Playgroud)
re.split如果你必须使用regex,那么你需要通过添加负向前瞻来修改你的模式 -
>>> list(map(str.strip, re.split(r"[.!?](?!$)", text)))
[
'The first time you see The Second Renaissance it may look boring',
'Look at it at least twice and definitely watch part 2',
'It will change your view of the matrix',
'Are the human people the ones who started the war',
'Is AI a bad thing?'
]
Run Code Online (Sandbox Code Playgroud)
添加(?!$)指定仅在尚未到达行尾时才拆分.不幸的是,我不确定最后句子上的尾随分隔符是否可以合理地删除而不执行类似的操作result[-1] = result[-1][:-1].
| 归档时间: |
|
| 查看次数: |
2507 次 |
| 最近记录: |