用'X'替换后重新计算子串计数更新

alv*_*vas 4 python string counter replace substring

给定一个字符串:

s = 'cdababef'
Run Code Online (Sandbox Code Playgroud)

我们计算前面的字符和后面的字符:

def per_window(sequence, n=1):
    """
    From http://stackoverflow.com/q/42220614/610569
        >>> list(per_window([1,2,3,4], n=2))
        [(1, 2), (2, 3), (3, 4)]
        >>> list(per_window([1,2,3,4], n=3))
        [(1, 2, 3), (2, 3, 4)]
    """
    start, stop = 0, n
    seq = list(sequence)
    while stop <= len(seq):
        yield tuple(seq[start:stop])
        start += 1
        stop += 1

char_before= defaultdict(Counter)
char_after = defaultdict(Counter) 
for window in per_window(s, 3):
    char_after[window[:2]][window[2]] += 1
    char_before[window[1:]][window[0]] += 1
Run Code Online (Sandbox Code Playgroud)

[OUT]:

>>> char_after
defaultdict(collections.Counter,
            {('a', 'b'): Counter({'a': 1, 'e': 1}),
             ('b', 'a'): Counter({'b': 1}),
             ('b', 'e'): Counter({'f': 1}),
             ('c', 'd'): Counter({'a': 1}),
             ('d', 'a'): Counter({'b': 1})})

>>> char_before
defaultdict(collections.Counter,
            {('a', 'b'): Counter({'b': 1, 'd': 1}),
             ('b', 'a'): Counter({'a': 1}),
             ('b', 'e'): Counter({'a': 1}),
             ('d', 'a'): Counter({'c': 1}),
             ('e', 'f'): Counter({'b': 1})})
Run Code Online (Sandbox Code Playgroud)

比方说,如果我取代的所有实例abx,我需要更新的char_afterchar_before计数,目标是实现无所有的子串的重统计s = 'cdxxef',如:

s = 'cdxxef'
char_before2 = defaultdict(Counter)
char_after2 = defaultdict(Counter) 
for window in per_window(s, 3):
    char_after2[window[:2]][window[2]] += 1
    char_before2[window[1:]][window[0]] += 1
Run Code Online (Sandbox Code Playgroud)

[期望的输出]:

>>> char_before2
defaultdict(collections.Counter,
            {('d', 'x'): Counter({'c': 1}),
             ('e', 'f'): Counter({'x': 1}),
             ('x', 'e'): Counter({'x': 1}),
             ('x', 'x'): Counter({'d': 1})})

>>> char_after2
defaultdict(collections.Counter,
            {('c', 'd'): Counter({'x': 1}),
             ('d', 'x'): Counter({'x': 1}),
             ('x', 'e'): Counter({'f': 1}),
             ('x', 'x'): Counter({'e': 1})})
Run Code Online (Sandbox Code Playgroud)

如何在不重新计算所有子字符串的情况下完成子字符串的更新,而只重新计算受替换影响的子字符串?


我试过了:

s = 'cdababef'

char_before= defaultdict(Counter)
char_after = defaultdict(Counter) 
for window in per_window(s, 3):
    char_after[window[:2]][window[2]] += 1
    char_before[window[1:]][window[0]] += 1

source, target = ('a', 'b'), 'x'
for ch in char_before[source]:
    count_before = char_before[source][ch]
    char_before[target][ch] += count_before
    char_before[source][ch] = 0

    count_after = char_after[source][ch]
    char_after[target][ch] += count_after
    char_before[source][ch] = 0
Run Code Online (Sandbox Code Playgroud)

但输出并不是所希望的,因为char_before2char_after2:

>>> char_before
defaultdict(collections.Counter,
            {'x': Counter({'b': 1, 'd': 1}),
             ('b', 'a'): Counter({'a': 1}),
             ('d', 'a'): Counter({'c': 1}),
             ('b', 'e'): Counter({'a': 1}),
             ('a', 'b'): Counter({'b': 0, 'd': 0}),
             ('e', 'f'): Counter({'b': 1})})

>>> char_after
defaultdict(collections.Counter,
            {'x': Counter({'b': 0, 'd': 0}),
             ('b', 'a'): Counter({'b': 1}),
             ('d', 'a'): Counter({'b': 1}),
             ('b', 'e'): Counter({'f': 1}),
             ('a', 'b'): Counter({'a': 1, 'e': 1}),
             ('c', 'd'): Counter({'a': 1})})
Run Code Online (Sandbox Code Playgroud)

bun*_*nji 5

这是一种通过三个步骤解决此问题的方法:

  1. 识别受替换影响的子串
  2. 从这些子字符串中的char_beforechar_after字典中删除计数
  3. 进行替换并将新计数添加到受影响的子字符串的字典char_beforechar_after字典中

首先让我们从定义一些变量并运行初始代码开始.

source, target = ('a', 'b'), 'x'
n = 3

char_before= defaultdict(Counter)
char_after = defaultdict(Counter) 
for window in per_window(s, n):
    char_after[window[:2]][window[2]] += 1
    char_before[window[1:]][window[0]] += 1
Run Code Online (Sandbox Code Playgroud)

现在我们找到将被替换的子串的跨度(开始和结束索引)(注意我们实际上还没有进行任何替换)

import re 

spans = [m.span() for m in re.finditer(''.join(source), s)]
Run Code Online (Sandbox Code Playgroud)

但是我们知道落入其中一个跨度的窗口的前后计数并不是唯一会受到替换影响的窗口.直接位于其中一个跨度之前或之后的任何窗口也将受到影响.例如,s = 'cdababef'如果我们'ab''x'初始替换'cd'将需要char_after更新计数,即使'cd'它自身的任何部分都没有被替换.

为了处理这个问题,我们定义了一个函数merge_spans,它不仅可以合并相邻的跨度((2,4)并且(4,6)变为(2,6)),还可以合并彼此extra间隔的跨距(其中extra是由关键字参数定义的整数).这背后的直觉是,这将返回一个跨度列表,其中跨度对应于计数在被替换影响之前/之后的所有子串.

def merge_spans(spans, extra = 0):
    extra = max(0,extra)
    merged = spans[:]
    if len(merged) == 1:
        return [(max(merged[0][0]-extra, 0), merged[0][-1]+extra)]
    for i in range(1, len(merged)):
        span = merged[i]
        prev = merged[i-1]
        if prev[-1]+extra >= span[0]-extra:
            merged[i] = (max(0,prev[0]-extra), span[-1]+extra)
            merged[i-1] = ()
        elif i == len(merged)-1:
            merged[i] = (max(0,span[0]-extra), span[-1]+extra)
            merged[i-1] = (max(0,prev[0]-extra), prev[-1]+extra)
        else:
            merged[i-1] = (max(0,prev[0]-extra), prev[-1]+extra)
    return list(filter(None, merged))       
Run Code Online (Sandbox Code Playgroud)

所以让我们创建这个跨度列表.我们设定extra,n-1因为n-1替换品任何一方的信件都会受到影响.

merged = merge_spans(spans, n-1)   
Run Code Online (Sandbox Code Playgroud)

现在我们可以迭代这些跨度并删除受替换影响的窗口的计数.然后我们可以在该范围内进行替换并更新计数.

for span in merged:
    sub_s = s[span[0]:span[-1]]
    for window in per_window(sub_s, n):
        char_after[window[:2]][window[2]] -= 1
        char_before[window[1:]][window[0]] -= 1
    new_s = sub_s.replace(''.join(source), target)
    for window in per_window(new_s, n):
        char_after[window[:2]][window[2]] += 1
        char_before[window[1:]][window[0]] += 1 
Run Code Online (Sandbox Code Playgroud)

请注意,上述内容会影响原始字典char_beforechar_after字典,但如果由于某种原因需要保留原始计数,则可以先复制它们.

最后,我们从计数器中删除任何计数0或负数,并完全删除任何不包含正计数的窗口.请注意,添加Counter()到计数器会删除任何值为正值的元素.

char_before2 = {k:v+Counter() for k,v in char_before.items() if any(v.values())}
char_after2 = {k:v+Counter() for k,v in char_after.items() if any(v.values())}
Run Code Online (Sandbox Code Playgroud)

结果如下:

>>> char_before2
{('d', 'x'): Counter({'c': 1}),
 ('e', 'f'): Counter({'x': 1}),
 ('x', 'e'): Counter({'x': 1}),
 ('x', 'x'): Counter({'d': 1})}

>>> char_after2 
{('c', 'd'): Counter({'x': 1}),
 ('d', 'x'): Counter({'x': 1}),
 ('x', 'e'): Counter({'f': 1}),
 ('x', 'x'): Counter({'e': 1})}
Run Code Online (Sandbox Code Playgroud)