使用 nltk 查找特定的索引

Dav*_*les 5 python nltk

我使用下面的代码从 nltk 获取索引,然后显示每个索引的索引。我得到的结果如下所示。到目前为止,一切都很好。

如何仅查找一个特定索引的索引?在这个小例子中,将索引与索引相匹配是很容易的,但如果我有 300 个索引,我想找到其中一个的索引。

.index不将列表中的多个项目作为参数。

有人可以指出我应该使用的命令/结构来获取与索引一起显示的索引吗?我在下面附加了一个更有用的结果的示例,该结果超出 nltk 以获得单独的索引列表。我想将这些合并为一个结果,但是如何实现呢?

import nltk 
nltk.download('punkt') 
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.text import Text

moby = open('mobydick.txt', 'r')

moby_read = moby.read() 
moby_text = nltk.Text(nltk.word_tokenize(moby_read))

moby_text.concordance("monstrous")

moby_indices  = [index for (index, item) in enumerate(moby_text) if item == "monstrous"]

print(moby_indices)
Run Code Online (Sandbox Code Playgroud)
Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size . ... This came towards us , 
N OF THE PSALMS . `` Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ; most monstrous and most mountainous ! That Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and more de
of Radney . ' '' CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
ere to enter upon those still more monstrous stories of them which are to be fo
ght have been rummaged out of this monstrous cabinet there is no telling . But 
e of Whale-Bones ; for Whales of a monstrous size are oftentimes cast up dead u

[858, 1124, 9359, 9417, 32173, 94151, 122253, 122269, 162203, 205095]
Run Code Online (Sandbox Code Playgroud)

我理想中希望有这样的东西。

Displaying 11 of 11 matches:
[858] ong the former , one was of a most monstrous size . ... This came towards us , 
[1124] N OF THE PSALMS . `` Touching that monstrous bulk of the whale or ork we have r
[9359] ll over with a heathenish array of monstrous clubs and spears . Some were thick
[9417] d as you gazed , and wondered what monstrous cannibal and savage could ever hav
[32173] that has survived the flood ; most monstrous and most mountainous ! That Himmal
[94151] they might scout at Moby Dick as a monstrous fable , or still worse and more de
[122253] of Radney . ' '' CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l
[122269] ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
[162203] ere to enter upon those still more monstrous stories of them which are to be fo
[162203] ght have been rummaged out of this monstrous cabinet there is no telling . But 
[205095] e of Whale-Bones ; for Whales of a monstrous size are oftentimes cast up dead u
Run Code Online (Sandbox Code Playgroud)

Ori*_*PhD 2

我们可以使用concordance_list函数(https://www.nltk.org/api/nltk.text.html),以便我们可以指定width和 的数量lines,然后迭代lines 获取'offset'(即行号)并添加周围的括号'[' ']'加上roi(即'monstrous') 和 之间leftright每个line):

some_text = open('/content/drive/My Drive/Colab Notebooks/DATA_FOLDERS/TEXT/mobydick.txt', 'r')
roi = 'monstrous'

moby_read = some_text.read()
moby_text = nltk.Text(nltk.word_tokenize(moby_read))
moby_text = moby_text.concordance_list(roi, width=22, lines=1000)
for line in moby_text:
    print('[' + str(line.offset) + '] ' + ' '.join(line.left) + ' ' + roi + ' ' + ' '.join(line.right))
Run Code Online (Sandbox Code Playgroud)

或者如果您发现此内容更具可读性 ( import numpy as np):

for line in moby_text:
    print('[' + str(line.offset) + '] ', np.append(' '.join(np.append(np.array(line.left), roi)), np.array(' '.join(line.right))))
Run Code Online (Sandbox Code Playgroud)

输出(我的行号与你的不匹配,因为我使用了这个来源: https: //gist.github.com/StevenClontz/4445774 ,它只是有不同的间距/行号):

[494] 306 LV . OF THE monstrous PICTURES OF WHALES .
[1385] one was of a most monstrous size . * *
[1652] the Psalms. ' Touching that monstrous bulk of the whale
[9874] with a heathenish array of monstrous clubs and spears .
[9933] gazed , and wondered what monstrous cannibal and savage could
[32736] survived the Flood ; most monstrous and most mountainous !
[95115] scout at Moby-Dick as a monstrous fable , or still
[121328] '' CHAPTER LV OF THE monstrous PICTURES OF WHALES I
[121991] this bookbinder 's fish an monstrous PICTURES OF WHALES 333
[122749] same field , Desmarest , monstrous PICTURES OF WHALES 335
[123525] SCENES IN connection with the monstrous pictures of whales ,
[123541] enter upon those still more monstrous stories of them which
Run Code Online (Sandbox Code Playgroud)

如果我们想考虑标点符号等,我们可以这样做:

for line in moby_text:
    left_words = [left_word for left_word in line.left]
    right_words = [right_word for right_word in line.right]
    return_text = '[' +  str(line.offset) + '] '
    for word in left_words:
        if any([word == '.', word == ',', word == ';', word == '!']):
            return_text += word
        else:
            return_text += ' ' + word if return_text[-1] != ' ' else word
    return_text += roi + ' '
    for word in right_words:
        if any([word == '.', word == ',', word == ';', word == '!']):
            return_text += word
        else:
            return_text += ' ' + word if return_text[-1] != ' ' else word
    print(return_text)
Run Code Online (Sandbox Code Playgroud)

输出:

[494] 306 LV. OF THE monstrous PICTURES OF WHALES.
[1385] one was of a most monstrous size. * *
[1652] the Psalms.' Touching that monstrous bulk of the whale
[9874] with a heathenish array of monstrous clubs and spears.
[9933] gazed, and wondered what monstrous cannibal and savage could
[32736] survived the Flood; most monstrous and most mountainous!
[95115] scout at Moby-Dick as a monstrous fable, or still
[121328] '' CHAPTER LV OF THE monstrous PICTURES OF WHALES I
[121991] this bookbinder 's fish an monstrous PICTURES OF WHALES 333
[122749] same field, Desmarest, monstrous PICTURES OF WHALES 335
[123525] SCENES IN connection with the monstrous pictures of whales,
[123541] enter upon those still more monstrous stories of them which
Run Code Online (Sandbox Code Playgroud)

但你可能需要调整它,因为我没有对可能出现的不同上下文进行太多思考(例如'*',数字、全部大写的章节标题、罗马数字等),这更取决于你你希望输出文本是什么样子——我只是提供一个例子。

注意: width函数中指的concordance_list是下一个左(和右)单词的最大4长度,因此如果我们将其设置为第一行将打印:

[494] THE monstrous
Run Code Online (Sandbox Code Playgroud)

因为len('THE ')4,所以将其设置为3会切断 的'THE'下一个左词'monstrous'

[494] monstrous
Run Code Online (Sandbox Code Playgroud)

whilelinesconcordance_list函数中指的是最大'monstrous'行数,因此如果我们只想包含(ie moby_text.concordance_list(..., lines=2))的前两行:

[494] 306 LV . OF THE monstrous PICTURES OF WHALES .
[1385] one was of a most monstrous size . * *
Run Code Online (Sandbox Code Playgroud)