我试图将整个段落输入到我的文字处理器中,先将其分成句子然后再分成单词.
我尝试了以下代码,但它不起作用,
#text is the paragraph input
sent_text = sent_tokenize(text)
tokenized_text = word_tokenize(sent_text.split)
tagged = nltk.pos_tag(tokenized_text)
print(tagged)
Run Code Online (Sandbox Code Playgroud)
但这不起作用,给我错误.那么如何将段落标记为句子然后单词呢?
一个示例段落:
这件事似乎压倒了这只小黑褐色的狗,使他受伤了.他在孩子脚下绝望地沉了下去.当重复一击,伴随着幼稚的句子中的警告,他转过身来,以一种特殊的方式握住他的爪子.在他的耳朵和眼睛的同时,他向孩子祈祷.
**警告:**这只是来自互联网的随机文本,我不拥有上述内容.
sli*_*der 35
你可能打算循环sent_text:
import nltk
sent_text = nltk.sent_tokenize(text) # this gives us a list of sentences
# now loop over each sentence and tokenize it separately
for sentence in sent_text:
tokenized_text = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(tokenized_text)
print(tagged)
Run Code Online (Sandbox Code Playgroud)
这是一个较短的版本.这将为您提供每个单独句子的数据结构,以及句子中的每个标记.我更喜欢TweetTokenizer用于凌乱的现实世界语言.句子标记符被认为是不错的,但是在这一步之后要小心不要降低你的单词大小写,因为它可能会影响检测凌乱文本边界的准确性.
from nltk.tokenize import TweetTokenizer, sent_tokenize
tokenizer_words = TweetTokenizer()
tokens_sentences = [tokenizer_words.tokenize(t) for t in
nltk.sent_tokenize(input_text)]
print(tokens_sentences)
Run Code Online (Sandbox Code Playgroud)
这是输出的样子,我清理了所以结构突出:
[
['This', 'thing', 'seemed', 'to', 'overpower', 'and', 'astonish', 'the', 'little', 'dark-brown', 'dog', ',', 'and', 'wounded', 'him', 'to', 'the', 'heart', '.'],
['He', 'sank', 'down', 'in', 'despair', 'at', 'the', "child's", 'feet', '.'],
['When', 'the', 'blow', 'was', 'repeated', ',', 'together', 'with', 'an', 'admonition', 'in', 'childish', 'sentences', ',', 'he', 'turned', 'over', 'upon', 'his', 'back', ',', 'and', 'held', 'his', 'paws', 'in', 'a', 'peculiar', 'manner', '.'],
['At', 'the', 'same', 'time', 'with', 'his', 'ears', 'and', 'his', 'eyes', 'he', 'offered', 'a', 'small', 'prayer', 'to', 'the', 'child', '.']
]
Run Code Online (Sandbox Code Playgroud)
小智 5
import nltk
textsample ="This thing seemed to overpower and astonish the little dark-brown dog, and wounded him to the heart. He sank down in despair at the child's feet. When the blow was repeated, together with an admonition in childish sentences, he turned over upon his back, and held his paws in a peculiar manner. At the same time with his ears and his eyes he offered a small prayer to the child."
sentences = nltk.sent_tokenize(textsample)
words = nltk.word_tokenize(textsample)
sentences
[w for w in words if w.isalpha()]
Run Code Online (Sandbox Code Playgroud)
上面的最后一行将确保输出中只有单词而不是特殊字符句子输出如下
['This thing seemed to overpower and astonish the little dark-brown dog, and wounded him to the heart.',
"He sank down in despair at the child's feet.",
'When the blow was repeated, together with an admonition in childish sentences, he turned over upon his back, and held his paws in a peculiar manner.',
'At the same time with his ears and his eyes he offered a small prayer to the child.']
Run Code Online (Sandbox Code Playgroud)
去掉特殊字符后输出的文字如下
['This',
'thing',
'seemed',
'to',
'overpower',
'and',
'astonish',
'the',
'little',
'dog',
'and',
'wounded',
'him',
'to',
'the',
'heart',
'He',
'sank',
'down',
'in',
'despair',
'at',
'the',
'child',
'feet',
'When',
'the',
'blow',
'was',
'repeated',
'together',
'with',
'an',
'admonition',
'in',
'childish',
'sentences',
'he',
'turned',
'over',
'upon',
'his',
'back',
'and',
'held',
'his',
'paws',
'in',
'a',
'peculiar',
'manner',
'At',
'the',
'same',
'time',
'with',
'his',
'ears',
'and',
'his',
'eyes',
'he',
'offered',
'a',
'small',
'prayer',
'to',
'the',
'child']
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
47143 次 |
| 最近记录: |