如何在 Python 中从 nltk.book 读取 nltk.text.Text 文件？

Question

如何在 Python 中从 nltk.book 读取 nltk.text.Text 文件？

我正在使用 nltk 学习很多关于自然语言处理的知识，可以做很多事情，但我无法找到从包中读取文本的方法。我试过这样的事情：

from nltk.book import *
text6 #Brings the title of the text
open(text6).read()
#or
nltk.book.text6.read()

Run Code Online (Sandbox Code Playgroud)

但它似乎不起作用，因为它没有文件 ID。之前似乎没有人问过这个问题，所以我认为答案应该很简单。您知道阅读这些文本的方法或如何将它们转换为字符串吗？提前致谢

Answer 1

alv*_*vas 7

让我们深入研究代码 =)

首先，nltk.book代码位于https://github.com/nltk/nltk/blob/develop/nltk/book.py

如果我们仔细观察，文本是作为nltk.Text对象加载的，例如text6来自https://github.com/nltk/nltk/blob/develop/nltk/book.py#L36：

text6 = Text(webtext.words('grail.txt'), name="Monty Python and the Holy Grail")

Run Code Online (Sandbox Code Playgroud)

该Text对象来自https://github.com/nltk/nltk/blob/develop/nltk/text.py#L286，您可以从http://www.nltk.org/book阅读有关如何使用它的更多信息/ch02.html

该webtext是从语料库nltk.corpus所以去的原始文本nltk.book.text6，你可以直接加载webtext，如

>>> from nltk.corpus import webtext
>>> webtext.raw('grail.txt')

Run Code Online (Sandbox Code Playgroud)

在fileids当您加载一个仅涉及PlaintextCorpusReader对象，而不是从Text对象（处理对象）：

>>> type(webtext)
<class 'nltk.corpus.reader.plaintext.PlaintextCorpusReader'>
>>> for filename in webtext.fileids():
...     print(filename)
... 
firefox.txt
grail.txt
overheard.txt
pirates.txt
singles.txt
wine.txt

Run Code Online (Sandbox Code Playgroud)

归档时间：	7 年，10 月前
查看次数：	3274 次
最近记录：	4 年，10 月前