Saq*_*lam 1 python nlp tokenize nltk python-3.x
在调用时word_tokenize我收到以下错误:
File "C:\Python34\lib\site-packages\nltk\tokenize\punkt.py", line 1322,
in _slices_from_text for match in
self._lang_vars.period_context_re().finditer(text):
TypeError: expected string or buffer
Run Code Online (Sandbox Code Playgroud)
我有一个大文本文件(1500.txt),我想从中删除停用词.我的代码如下:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
with open('E:\\Book\\1500.txt', "r", encoding='ISO-8859-1') as File_1500:
stop_words = set(stopwords.words("english"))
words = word_tokenize(File_1500)
filtered_sentence = [w for w in words if not w in stop_words]
print(filtered_sentence)
Run Code Online (Sandbox Code Playgroud)
输入word_tokenize是文档流句子,即字符串列表,例如['this is sentence 1.', 'that's sentence 2!'].
这File_1500是一个File对象而不是字符串列表,这就是为什么它不起作用.
要获取句子字符串列表,首先必须将文件作为字符串对象读取fin.read(),然后sent_tokenize用来分割句子(我假设您的输入文件不是句子标记化的,只是原始文本文件).
此外,使用NLTK以这种方式标记文件更好/更惯用:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
stop_words = set(stopwords.words("english"))
with open('E:\\Book\\1500.txt', "r", encoding='ISO-8859-1') as fin:
for sent in sent_tokenize(fin.read()):
words = word_tokenize(sent)
filtered_sentence = [w for w in words if not w in stop_words]
print(filtered_sentence)
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
6956 次 |
| 最近记录: |