如何在NLTK中使用书籍功能(例如concoordance)？

Question

如何在NLTK中使用书籍功能(例如concoordance)？

我下载了一个名为的集合book:

>>> import nltk
>>> nltk.download()

Run Code Online (Sandbox Code Playgroud)

和进口文本:

>>> from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811

Run Code Online (Sandbox Code Playgroud)

然后我可以在这些文本上运行命令:

>>> text1.concordance("monstrous")

Run Code Online (Sandbox Code Playgroud)

如何在我自己的数据集上运行这些nltk命令？这些集合是否与bookpython中的对象相同？

Answer 1

alv*_*vas 4

你是对的，很难找到该book.py模块的文档。因此，我们必须亲自动手查看代码（请参阅此处）。看着book.py，用书籍模块进行一致性和所有奇特的东西：

首先，您必须将原始文本放入 nltk 的corpus类中，请参阅使用 NLTK 创建新语料库了解更多详细信息。

其次，您将语料库单词读入 NLTK 的Text课程中。然后你可以使用你在http://nltk.org/book/ch01.html中看到的函数

from nltk.corpus import PlaintextCorpusReader
from nltk.text import Text

# For example, I create an example text file
text1 = '''
This is a story about a foo bar. Foo likes to go to the bar and his last name is also bar. At home, he kept a lot of gold chocolate bars.
'''
text2 = '''
One day, foo went to the bar in his neighborhood and was shot down by a sheep, a blah blah black sheep.
'''
# Creating the corpus
corpusdir = './mycorpus/' 
with (corpusdir+'text1.txt','w') as fout:
    fout.write(text1)
with (corpusdir+'text2.txt','w') as fout:
    fout.write(text2, fout)

# Read the the example corpus into NLTK's corpus class.
mycorpus = PlaintextCorpusReader(corpusdir, '.*')

# Read the NLTK's corpus into NLTK's text class, 
# where your book-like concoordance search is available
mytext = Text(mycorpus.words())

mytext.concoordance('foo')

Run Code Online (Sandbox Code Playgroud)

注意：您可以使用其他 NLTK 的 CorpusReaders，甚至可以指定自定义段落/句子/单词标记器和编码，但现在，我们将坚持使用默认值

归档时间：	12 年，6 月前
查看次数：	881 次
最近记录：	8 年，6 月前