如何获得给定窗口大小内的所有二元组？

Question

如何获得给定窗口大小内的所有二元组？

假设我有这个字符串

my_string = "This is an example string"

Run Code Online (Sandbox Code Playgroud)

我想知道是否有一种快速的方法来计算给定“窗口”内的所有二元组。

例如，如果窗口是两个词，则所有可能的二元组都是

["This is","is This","is an","an is","an example","example an","example string","string example"]

Run Code Online (Sandbox Code Playgroud)

但是如果窗口是三个词，我们有第一个三词窗口的这些二元组

["This is","is an","This an","an this",...]

Run Code Online (Sandbox Code Playgroud)

使用 sklearn 很容易获得二元组。例如一个可以做

bigrams = CountVectorizer(analyzer = "word",
                  strip_accents = "ascii",
                  lowercase = True,
                  ngram_range = (2,2))

bigrams_counts = bigrams.fit_transform(my_string)

Run Code Online (Sandbox Code Playgroud)

并且会给你所有二元组的列表（甚至计数），但它只会包括字符串中存在的二元组，而不包括其他组合（即“This an”和“an this”将不存在）。

那么，您知道是否有办法获取给定窗口内的所有二元组吗？

Answer 1

Ilj*_*ilä 5

从示例中：

["This is","is an","This an","an this",...]

Run Code Online (Sandbox Code Playgroud)

这些看起来不像二元组，而是来自 window 的单词排列。对于 3 个词，这将是：

from itertools import permutations, chain
from functools import partial

my_string = "This is an example string".split()
set(chain.from_iterable(map(partial(permutations,
                                    r=2),
                            zip(my_string,
                                my_string[1:],
                                my_string[2:]))))

Run Code Online (Sandbox Code Playgroud)

如果您需要计数，请使用 a Counter，但要注意重叠会导致给定单词对计数的加倍、三倍等（取决于重叠量，例如窗口的大小）。

from collections import Counter

Counter(chain.from_iterable(map(partial(combinations, r=2),
                                zip(my_string,
                                    my_string[1:],
                                    my_string[2:]))))

Run Code Online (Sandbox Code Playgroud)

结果：

Counter({('is', 'an'): 2, ('an', 'example'): 2, ('This', 'is'): 1, ('This', 'an'): 1, ('example', 'string'): 1, ('an', 'string'): 1, ('is', 'example'): 1})

Run Code Online (Sandbox Code Playgroud)

最后，如果您需要将窗口作为单独的结果，请跳过链接：

list(map(partial(permutations, r=2),
         zip(my_string, my_string[1:], my_string[2:])))

Run Code Online (Sandbox Code Playgroud)

归档时间：	9 年，9 月前
查看次数：	1038 次
最近记录：	9 年，9 月前