从一个部分(python)中提取一个单词加20多个单词

N00*_*mer 4 python extraction nltk gensim

Jep还在玩Python.

我决定试用Gensim,这是一个工具,可以找到所选单词和上下文的主题.

所以我想知道如何在一段文字中找到一个单词,并在其中提取20个单词(如在该单词之前的10个单词和该单词之后的10个单词)然后将其与其他此类提取一起保存,以便Gensim可以跑吧.

对我来说似乎很难的是找到一种方法来在找到选择的单词时提取单词之前和之后的10.之前我和nltk玩过,只是将文本标记为单词或句子,很容易掌握句子.在特定句子之前和之后仍然得到那些单词或句子似乎很难弄清楚如何去做.

对于那些困惑的人(这是凌晨1点,所以我可能会感到困惑)我会用一个例子来展示它:

一旦它完成,她的所有血液都冲到了她的心脏,因为她听到那个白雪公主还活着时非常生气."但是现在,"她自言自语道,"我会做出一些能彻底摧毁她的东西." 这么说,她用艺术梳理了一把毒药,她明白了,然后,她伪装成一个老寡妇.她走过七座小山,到了七个小矮人的房子里,[15]敲门,喊道:"今天卖的好东西!"

如果我们说这个词是Snow-White那么我想要提取这个部分:

她的心,因为听到白雪公主还活着,她很生气."但是现在,"她自己想,"会

雪白之前和之后的10个字.

如果可以在nltk中完成并且更容易,那么在Snow-White出现的句子之前和之后获得句子也足够酷.

我的意思是无论什么效果最好,如果有人能帮助我,我会对两种解决方案中的一种感到满意.

如果这也可以用Gensim完成......那就更容易了,那么我也会对此感到高兴.所以这三种方式中的任何一种都没问题......我只想试着看看如何做到这一点,因为我的脑袋一片空白.

Ray*_*ger 7

该过程称为上下文中的关键字(KWIC).

第一步是将输入分成单词.使用正则表达式模块有很多方法可以做到这一点,例如参见re.splitre.findall.

找到特定的单词后,您可以使用切片查找之前的十个单词和之后的十个单词.

要为所有单词构建索引,使用maxlen 的双端队列便于实现滑动窗口.

以下是使用itertools高效执行此操作的一种方法:

from re import finditer
from itertools import tee, islice, izip, chain, repeat

def kwic(text, tgtword, width=10):
    'Find all occurrences of tgtword and show the surrounding context'
    matches = (mo.span() for mo in finditer(r"[A-Za-z\'\-]+", text))
    padded = chain(repeat((0,0), width), matches, repeat((-1,-1), width))
    t1, t2, t3 = tee((padded), 3)
    t2 = islice(t2, width, None)
    t3 = islice(t3, 2*width, None)
    for (start, _), (i, j), (_, stop) in izip(t1, t2, t3):
        if text[i: j] == tgtword:
            context = text[start: stop]
            yield context

print list(kwic(text, 'Snow-White'))
Run Code Online (Sandbox Code Playgroud)


Ash*_*ary 6

text = """
As soon as it had finished, all her blood rushed to her heart, for she was so angry to hear that Snow-White was yet living. "But now," thought she to herself, "will I make something which shall destroy her completely." Thus saying, she made a poisoned comb by arts which she understood, and then, disguising herself, she took the form of an old widow. She went over the seven hills to the house of the seven Dwarfs, and[15] knocking at the door, called out, "Good wares to sell to-day!"
"""
spl = text.split()

def ans(word):
    for ind, x in enumerate(spl):
       if x.strip(",'\".!") == word:
           break
    return " ".join(spl[ind-10:ind] + spl[ind:ind+11])


>>> ans('Snow-White')
her heart, for she was so angry to hear that Snow-White was yet living. "But now," thought she to herself, "will
Run Code Online (Sandbox Code Playgroud)