文本Python中重复的短语

Question

文本Python中重复的短语

Bob*_*Bob 2 python text repeat

我有一个问题,我不知道如何解决它.请提出一些建议.

我有一个文字.大,大文.任务是找到文本中长度为3(包含三个单词)的所有重复短语.

Answer 1

Rob*_*ney 7

在我看来,你有两个问题.

第一个是提出一种标准化输入的有效方法.你说你想在输入中找到所有三个单词的短语,但是什么构成一个短语？例如,是the black dog和The black, dog?同一个短语？

正如马克科所暗示的那样,这样做的方法是使用类似的东西re.findall.但这是非常低效的:它遍历您的整个输入并将单词复制到列表中,然后您必须处理该列表.如果您的输入文本很长,那将浪费时间和空间.

更好的方法是将输入视为流,并构建一个一次拉出一个单词的生成器.这是一个示例,它使用空格作为单词之间的分隔符,然后从单词中删除非alpha字符并将它们转换为小写:

>>> def words(text):
       pattern = re.compile(r"[^\s]+")
       non_alpha = re.compile(r"[^a-z]", re.IGNORECASE)
       for match in pattern.finditer(text):
           nxt = non_alpha.sub("", match.group()).lower()
           if nxt:  # skip blank, non-alpha words
               yield nxt


>>> text
"O'er the bright blue sea, for Sir Joseph Porter K.C.B."
>>> list(words(text))
['oer', 'the', 'bright', 'blue', 'sea', 'for', 'sir', 'joseph', 'porter', 'kcb']

Run Code Online (Sandbox Code Playgroud)

第二个问题是将规范化的单词分组为三个单词的短语.同样,这里是发电机有效运行的地方:

>>> def phrases(words):
        phrase = []
        for word in words:
            phrase.append(word)
            if len(phrase) > 3:
                phrase.remove(phrase[0])
            if len(phrase) == 3:
                yield tuple(phrase)

>>> list(phrases(words(text)))
[('oer', 'the', 'bright'), ('the', 'bright', 'blue'), ('bright', 'blue', 'sea'), ('blue', 'sea', 'for'), ('sea', 'for', 'sir'), ('for', 'sir', 'joseph'), ('sir', 'joseph', 'porter'), ('joseph', 'porter', 'kcb')]

Run Code Online (Sandbox Code Playgroud)

几乎可以肯定的是,该功能的更简单版本可能,但这个功能很有效,并且不难理解.

值得注意的是,将生成器链接在一起只会遍历列表一次,并且它不会在内存中构建任何大型临时数据结构.您可以使用结果构建一个defaultdict键入的短语:

>>> import collections
>>> counts = collections.defaultdict(int)
>>> for phrase in phrases(words(text)):
        counts[phrase] += 1

Run Code Online (Sandbox Code Playgroud)

这会使单个传递text计数短语.完成后,查找字典中值大于1的每个条目.

归档时间：	15 年，2 月前
查看次数：	3636 次
最近记录：	10 年，2 月前