Kar*_*ara 2 python string algorithm optimization python-3.x
我有长度的字符串n的字母组成A,G,C和T。如果字符串包含相等数量的、和(每次)A,则该字符串是稳定的。我需要找到替换后使其稳定的子串的最小长度。这是问题完整描述的链接。GCTn/4
假设s1=AAGAAGAA。
现在因为n=8理想情况下它应该有 2A秒、2T秒、2G秒和 2C秒。它有 4 个多余的As。因此我们需要一个至少包含 4 个As的子串。
我首先从左边取一个 4 个字符的子字符串,如果没有找到,那么我增加一个变量mnum(即寻找 5 个变量子字符串等等)。
我们得到AAGAA了答案。但它太慢了。
from collections import Counter
import sys
n=int(input()) #length of string
s1=input()
s=Counter(s1)
le=int(n/4) #ideal length of each element
comp={'A':le,'G':le,'C':le,'T':le} #dictionary containing equal number of all elements
s.subtract(comp) #Finding by how much each element ('A','G'...) is in excess or loss
a=[]
b=[]
for x in s.values(): #storing frequency(s.values--[4,2]) of elements which are in excess
if(x>0):
a.append(x)
for x in s.keys(): #storing corresponding elements(s.keys--['A','G'])
if(s[x]>0):
b.append(x)
mnum=sum(a) #minimum substring length to start with
if(mnum==0):
print(0)
sys.exit
flag=0
while(mnum<=n): #(when length 4 substring with all the A's and G's is not found increasing to 5 and so on)
for i in range(n-mnum+1): #Finding substrings with length mnum in s1
for j in range(len(a)): #Checking if all of excess elements are present
if(s1[i:i+mnum].count(b[j])==a[j]):
flag=1
else:
flag=0
if(flag==1):
print(mnum)
sys.exit()
mnum+=1
Run Code Online (Sandbox Code Playgroud)
可以在O(N)时间和O(N)空间中找到最小子串。
首先fr[i]从 length 的输入中计算每个字符的频率n。现在,要认识到的最重要的事情是子串被认为是最小的充分必要条件,它必须包含每个频率至少为 的过多字符fr[i] - n/4。否则,将无法替换丢失的字符。因此,我们的任务是遍历每个这样的子串并选择长度最小的子串。
但是如何有效地找到所有这些呢?
开始时,minLength是n。我们引入2指针索引 -left和right(最初0)在原始字符串中定义子字符串 from leftto 。然后,我们递增,直到每个多余字符的频率至少为。但这还不是全部,因为左侧可能包含不必要的字符(例如,它们并不过分,因此可以删除)。所以,只要仍然包含足够多的元素,我们就会递增。当我们完成后,我们更新它是否大于。我们重复这个过程直到。rightstrrightstr[left:right]fr[i] - n/4str[left : right]leftstr[left : right]minLengthright - leftright >= n
让我们考虑一个例子。让GAAAAAAA成为输入字符串。那么,算法步骤如下:
1.统计每个字符出现的频率:
['G'] = 1, ['A'] = 6, ['T'] = 0, ['C'] = 0 ('A' is excessive here)
Run Code Online (Sandbox Code Playgroud)
2.现在遍历原始字符串:
Step#1: |G|AAAAAAA
substr = 'G' - no excessive chars (left = 0, right = 0)
Step#2: |GA|AAAAAA
substr = 'GA' - 1 excessive char, we need 5 (left = 0, right = 1)
Step#3: |GAA|AAAAA
substr = 'GAA' - 2 excessive chars, we need 5 (left = 0, right = 2)
Step#4: |GAAA|AAAA
substr = 'GAAA' - 3 excessive chars, we need 5 (left = 0, right = 3)
Step#5: |GAAAA|AAA
substr = 'GAAAA' - 4 excessive chars, we need 5 (left = 0, right = 4)
Step#6: |GAAAAA|AA
substr = 'GAAAAA' - 5 excessive chars, nice but can we remove something from left? 'G' is not excessive anyways. (left = 0, right = 5)
Step#7: G|AAAAA|AA
substr = 'AAAAA' - 5 excessive chars, wow, it's smaller now. minLength = 5 (left = 1, right = 5)
Step#8: G|AAAAAA|A
substr = 'AAAAAA' - 6 excessive chars, nice, but can we reduce the substr? There's a redundant 'A'(left = 1, right = 6)
Step#9: GA|AAAAA|A
substr = 'AAAAA' - 5 excessive chars, nice, minLen = 5 (left = 2, right = 6)
Step#10: GA|AAAAAA|
substr = 'AAAAAA' - 6 excessive chars, nice, but can we reduce the substr? There's a redundant 'A'(left = 2, right = 7)
Step#11: GAA|AAAAA|
substr = 'AAAAA' - 5 excessive chars, nice, minLen = 5 (left = 3, right = 7)
Step#12: That's it as right >= 8
Run Code Online (Sandbox Code Playgroud)
或者下面的完整代码:
from collections import Counter
n = int(input())
gene = raw_input()
char_counts = Counter()
for i in range(n):
char_counts[gene[i]] += 1
n_by_4 = n / 4
min_length = n
left = 0
right = 0
substring_counts = Counter()
while right < n:
substring_counts[gene[right]] += 1
right += 1
has_enough_excessive_chars = True
for ch in "ACTG":
diff = char_counts[ch] - n_by_4
# the char cannot be used to replace other items
if (diff > 0) and (substring_counts[ch] < diff):
has_enough_excessive_chars = False
break
if has_enough_excessive_chars:
while left < right and substring_counts[gene[left]] > (char_counts[gene[left]] - n_by_4):
substring_counts[gene[left]] -= 1
left += 1
min_length = min(min_length, right - left)
print (min_length)
Run Code Online (Sandbox Code Playgroud)