我有这样一句话:
s = " foo hello hello hello I am a big mushroom a big mushroom hello hello bye bye bye bye foo"
Run Code Online (Sandbox Code Playgroud)
我想找到所有连续重复的单词序列和每个序列重复的次数.对于上面的例子:
[('hello', 3), ('a big mushroom', 2), ('hello', 2), ('bye', 4)]
Run Code Online (Sandbox Code Playgroud)
我有一个解决方案几乎适用于基于正则表达式的只有一个字符的单词,但我无法将其扩展到真实单词的情况:
def count_repetitions(sentence):
return [(list(t[0]),''.join(t).count(t[0])) for t in re.findall(r'(\w+)(\1+)', ''.join(sentence))]
l=['x', 'a', 'b', 'c', 'a', 'b', 'c', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'g', 'h', 'i', 'i', 'i', 'i', 'a', 'b', 'c', 'd']
count_repetitions(sentence)
>>> [(['a', 'b', 'c'], 3), (['g', 'h'], 2), (['i', 'i'], 2)]
Run Code Online (Sandbox Code Playgroud)
请注意,我想要(['i'], 4)最后一个元素.
每个单词由空格字符分隔.
这可以通过捕获组的正则表达式来完成.
你通常可以使用正则表达式来捕获重复的模式:(pattern)\1+.这样做是递归地尝试匹配一个pattern后跟自己至少一次.
为了使其适应您的问题,我们只需要考虑您希望单词由空格字符分隔.这是我们的新正则表达式:\b((.+?)(?:\s\2)+).
( # open a group to capture the whole expression, GROUP 1
( # open a group to capture the repeated token, GROUP 2
\b # boundary metacharacters ensure the token is a whole word
.+? # matches anything non-greedily
\b
)
(?: # open a non-capturing group for the repeated terms
\s # match a space
\2 # match anything matched by GROUP 2
)+ # match one time or more
)
Run Code Online (Sandbox Code Playgroud)
然后使用re.findall我们可以找到所有这些模式并评估它们的重复次数.
import re
def find_repeated_sequences(s):
match = re.findall(r'((\b.+?\b)(?:\s\2)+)', s)
return [(m[1], int((len(m[0]) + 1) / (len(m[1]) + 1))) for m in match]
Run Code Online (Sandbox Code Playgroud)
注意:该公式(len(m[0]) + 1) / (len(m[1]) + 1)假设文本只是单行间距,并且来自求解等式:
长度总数 =计数x(长度el + 1) - 1
s = " foo hello hello hello I am a big mushroom a big mushroom hello hello bye bye bye bye"
print(find_repeated_sequences(s))
Run Code Online (Sandbox Code Playgroud)
[('hello', 3), ('a big mushroom', 2), ('hello', 2), ('bye', 4)]
Run Code Online (Sandbox Code Playgroud)