如何计算文件中的句子,单词和字符的数量?

aks*_*aks 7 python nltk

我编写了以下代码来标记来自文件samp.txt的输入段落.有人可以帮我找出并打印文件中的句子,单词和字符的数量吗?我在python中使用了NLTK.

>>>import nltk.data
>>>import nltk.tokenize
>>>f=open('samp.txt')
>>>raw=f.read()
>>>tokenized_sentences=nltk.sent_tokenize(raw)
>>>for each_sentence in tokenized_sentences:
...   words=nltk.tokenize.word_tokenize(each_sentence)
...   print each_sentence   #prints tokenized sentences from samp.txt
>>>tokenized_words=nltk.word_tokenize(raw)
>>>for each_word in tokenized_words:
...   words=nltk.tokenize.word_tokenize(each_word)
...   print each_words      #prints tokenized words from samp.txt
Run Code Online (Sandbox Code Playgroud)

ins*_*get 8

以这种方式尝试(该程序假定您正在使用指定目录中的一个文本文件dirpath):

import nltk
folder = nltk.data.find(dirpath)
corpusReader = nltk.corpus.PlaintextCorpusReader(folder, '.*\.txt')

print "The number of sentences =", len(corpusReader.sents())
print "The number of patagraphs =", len(corpusReader.paras())
print "The number of words =", len([word for sentence in corpusReader.sents() for word in sentence])
print "The number of characters =", len([char for sentence in corpusReader.sents() for word in sentence for char in word])
Run Code Online (Sandbox Code Playgroud)

希望这可以帮助


Max*_* E. -4

已经有一个计算单词和字符的程序了wc—— 。