从其排列构造字符串，最少跳回

Question

从其排列构造字符串，最少跳回

Nic*_*sHi 16 python string algorithm optimization matching

找到多项式时间算法或证明以下问题的 np 难度：

给定两个字符串s1=a1, a2,...,ak，s2=b1,...,bk,其中s2是的随机排列s1。

我们现在想要s1构建s2. 建设工程如下：

从中选择一个字母s2，等于a1并将其删除。

继续，用字母a2并将其删除等等，直到s2为空。

请注意，同一个字母可以在中出现多次s2。让C = c1, c2,...,ck为构造的序列，因此C = s1。我们定义为需要跳回以选择下一个字母l的次数。s2

例如，如果我们选择c3和，但原始中的c4索引小于原始中的索引，我们就会加一。c4s2c3s2l

我们的任务是找到最佳序列，使之l最小。例如，如果给定s1=abac，s2=acab输出必须为 1，因为我们可以从中选取第二个“a” s2，然后选择“b”，然后跳回并选取第一个“a”，然后添加“c”。

我不知道如何在多项式时间内解决这个问题。我想也许有某种方法可以计算完美匹配并读取最佳序列，但我不确定。到目前为止，我只有指数算法。

指数算法如下（不确定是否正确，不知道如何测试）：

def solvenaive(s1, s2, curr_ind):
    if len(s1) == 0:
        return 0
    first_s1 = s1[0]
    
    vorkommen = findOccurrences(s2, first_s1)
    results = []

    for i in vorkommen:
        new_s1 = s1[1:]
        new_s2 = stringPop(s2, i)
        res = solvenaive(new_s1, new_s2, i)
        
        if curr_ind > i:
            results.append(res+1)
        else:
            results.append(res)
    
    return min(results)

Run Code Online (Sandbox Code Playgroud)

Answer 1

chr*_*slg 6

只是为了澄清我对 Branch&bound 的评论，这是我的镜头。

\n

它显然比你的更快（同时保持“天真的”纯Python实现。例如，当我们明确想要“改变”它们时使用不可变字符串：D）。然而，它显然仍然不是多项式。只是一个更快的指数。

\n

分支与绑定的大多数情况就是这种情况。

\n

import itertools\nimport random\nimport numpy as np\nimport timeit\nimport matplotlib.pyplot as plt\n\nW=10 # Size of a word\n\n# Generate a random word\ndef randomWord():\n    return \'\'.join(np.random.choice(list("abcdef"),W))\n\n# And a random shuffling of it\ndef shuffle(s):\n    return \'\'.join(random.sample(s,len(s)))\n\n# Quick\'n\'dirty implementation of the function you are using\n# It would have been better, btw, if you had provided them, as\n# needed to create a minimal reproducible example\ndef findOccurrences(s, c):\n    return [pos for pos, char in enumerate(s) if char == c]\n\ndef stringPop(s, i):\n    return s[:i]+s[i+1:]\n\n# Your function\ndef solvenaive(s1, s2, curr_ind):\n    if len(s1) == 0:\n        return 0\n    first_s1 = s1[0]\n    \n    vorkommen = findOccurrences(s2, first_s1)\n    results = []\n\n    for i in vorkommen:\n        new_s1 = s1[1:]\n        new_s2 = stringPop(s2, i)\n        res = solvenaive(new_s1, new_s2, i)\n        \n        if curr_ind > i:\n            results.append(res+1)\n        else:\n            results.append(res)\n    \n    return min(results)\n\n# Mine. Almost the same.\n# But I add 2 parameters\n# bestScore which is what we have seen best\n# and sofar, a partial score \ndef solvecut(s1, s2, curr_ind, bestScore, sofar=0):\n    if len(s1)==0: # If we reach a "leaf" of recursion, then\n                   # return the score. (sofar is a real score in that case, not a partial)\n        return sofar\n    # Otherwise, we will recurse... unless it is a lost cause\n    if sofar>=bestScore: # No need to try better, we are already over our best\n        return 1e9\n\n    first_s1 = s1[0]\n    \n    vorkommen = findOccurrences(s2, first_s1)\n\n\n    for i in vorkommen:\n        new_s1 = s1[1:] # I let that one here in the loop because I don\'t want to optimize anything other that my "cuts"\n                        # so that improvement are clearly imputable to those cuts\n                        # but, obviously that could have been done outside the loop\n        new_s2 = stringPop(s2, i)\n        if curr_ind>i:\n            res=solvecut(new_s1, new_s2, i, bestScore, sofar+1)\n        else:\n            res=solvecut(new_s1, new_s2, i, bestScore, sofar)\n\n        if res<bestScore:\n            bestScore=res\n\n    return bestScore # Sometimes we\'ll return that just because we did nothing best, but it doesn\'t matter\n\n\n# Test if result are the same on some examples\n\nfor i in range(20):\n    w=randomWord()\n    s=shuffle(w)\n    sc1=solvenaive(w,s,0)\n    sc2=solvecut(w, s, 0, 1e9)\n    print(w, s, sc1, sc2)\n\n# Timeit\n# Note that timing depends a lot of the word (letter repetition, etc.). So it is important to do them with the same words\n# Especially since this is not very fast, so we need to timit with number=1\nW=17\nw1=randomWord()\ns1=shuffle(w1)\nprint(timeit.timeit(lambda: solvenaive(w1, s1, 0), number=1))\nprint(timeit.timeit(lambda: solvecut(w1, s1, 0, 1e9), number=1))\n

Run Code Online (Sandbox Code Playgroud)\n

结果

\n

cbaaecffae aaebcffcae 2 2\needeafdffb ffdbeefaed 3 3\naedefdceba adeefcabde 4 4\naaafccaafb aafacaafcb 2 2\neaffdfafef efdffefaaf 2 2\ndedbfffbce bcdffbedfe 3 3\nfedfcbaeed bedcfaefde 4 4\nffeebfcfab feaebcffbf 3 3\nfcdaddbfbf ffddacbfbd 3 3\ndeffcdeaea eeafdadfce 4 4\ncafeeebcdb beaedfbecc 4 4\nbeefdaeabd edaefbbdea 3 3\nddccbdbdae cdbdbcdead 3 3\nbcbababdef decafbbbba 4 4\ndfaeceefea edfcefaaee 2 2\nabcdcbbdbf fbbadccbdb 4 4\nacbedbaefc cbeacaebdf 3 3\naaffacfbde cfaaeafbfd 3 3\nedceddebdc dedcecbded 2 2\necafabdfea fafaeaedbc 3 3\n4.060046362923458\n0.05187216284684837\n

Run Code Online (Sandbox Code Playgroud)\n

因此我们可以合理地相信我的代码返回的结果始终与您的相同。

\n

但时间却快了近 100 倍。

\n

然而（我已经尝试过许多 W 的值，不幸的是，尽管存在随机性，但它是显而易见的），它仍然是指数的：

\n

1 cut=6.040092557668686e-06 \n2 cut=8.61193984746933e-06 \n3 cut=1.5990110114216805e-05 \n4 cut=1.5911879017949104e-05 \n5 cut=2.897903323173523e-05 \n6 cut=3.5561854019761086e-05 \n7 cut=5.8827921748161316e-05 \n8 cut=0.00018196692690253258 \n9 cut=0.0001191610936075449 \n10 cut=0.0002572301309555769 \n11 cut=0.0008078841492533684 \n12 cut=0.0009584939107298851 \n13 cut=0.008062224835157394 \n14 cut=0.004587493138387799 \n15 cut=0.08773919194936752 \n16 cut=0.02209200686775148 \n17 cut=0.045466484036296606 \n18 cut=0.11209587100893259 \n19 cut=0.40787983988411725 \n20 cut=2.2789966259151697 \n21 cut=1.8927371250465512 \n22 cut=1.097903796005994 \n

Run Code Online (Sandbox Code Playgroud)\n

不平滑......但我们可以清楚地看到，“线性且有大量噪声”的是时间的对数，而不是值。所以，它仍然是指数级的。

\n

还要注意，通过一些思考，我们应该能够估计出比迄今为止更好的可达到分数的最小界限（使用到目前为止意味着我们仍然总是希望有一个解决方案，而不会返回。例如，如果curr_ind>0。所以我们至少可以通过说如果curr_ind>0，则sofar+1必须小于，bestScore否则尝试是没有用的来改进）

\n

编辑：我无法抗拒并尝试一下。\n所以这里有一些轻微的优化

\n

1 cut=6.040092557668686e-06 \n2 cut=8.61193984746933e-06 \n3 cut=1.5990110114216805e-05 \n4 cut=1.5911879017949104e-05 \n5 cut=2.897903323173523e-05 \n6 cut=3.5561854019761086e-05 \n7 cut=5.8827921748161316e-05 \n8 cut=0.00018196692690253258 \n9 cut=0.0001191610936075449 \n10 cut=0.0002572301309555769 \n11 cut=0.0008078841492533684 \n12 cut=0.0009584939107298851 \n13 cut=0.008062224835157394 \n14 cut=0.004587493138387799 \n15 cut=0.08773919194936752 \n16 cut=0.02209200686775148 \n17 cut=0.045466484036296606 \n18 cut=0.11209587100893259 \n19 cut=0.40787983988411725 \n20 cut=2.2789966259151697 \n21 cut=1.8927371250465512 \n22 cut=1.097903796005994 \n

Run Code Online (Sandbox Code Playgroud)\n

这里没有太多优化

\n

我把new_s1计算放在循环之外
我将 s2 位置的迭代与对这些位置的搜索结合起来
正如我所说，如果curr_ind>0我们知道我们无法达到小于sofar+1. 这可能是最重要的优化，即使只是这个+1，因为它切断了即将等于最佳分数的整个递归分支。

\n

我这次得到的时间

\n

def lessNaive(s1, s2, curr_ind, bestScore, sofar):\n    if len(s1)==0:\n        return sofar\n    if sofar>=bestScore or (curr_ind>0 and sofar+1>=bestScore): # No need to try better, we are already over our best\n        return 1e9\n\n    first_s1 = s1[0]\n    new_s1 = s1[1:]\n\n    for i in range(len(s2)):\n        if s2[i]!=first_s1: continue\n        new_s2 = stringPop(s2, i)\n        if curr_ind>i:\n            res=lessNaive(new_s1, new_s2, i, bestScore, sofar+1)\n        else:\n            res=lessNaive(new_s1, new_s2, i, bestScore, sofar)\n\n        if res<bestScore:\n            bestScore=res\n\n    return bestScore # Sometimes we\'ll return that just because we did nothing best, but it doesn\'t matter\n

Run Code Online (Sandbox Code Playgroud)\n

[天真的时代，削减，现在不那么天真了]

\n

这显然是一个更难的情况（因为它与之前的 4.06/0.051 的 W 相同）。但重点是，即使是像这样的小优化+1时间也会改变一个数量级！\n（然而，它仍然是指数。一个更小的指数，但是指数。指数的事情是，两个指数之间的比率也是指数！所以即使从一个指数到另一个指数也确实值得。但是，最小的指数当然仍然不是多项式）

\n

定时更新

\n

与凯利讨论后的后续行动：我进行了一些其他测试。有 6 个版本，即 3 个版本已经测试过（naive、cut、lessnaive），有或没有lru_cache讨论。

\n

操作方式是，我随机生成一个 W=22 的单词并对其进行随机播放。然后用相同的单词运行所有 6 个代码（这是我之前已经做过的。因为时序根据输入的随机性而变化，所以必须使用相同的输入进行时序比较）。\n与之前时序的差异（在lru_cache版本之外）是我保留了最小/最大/平均计时比率。

\n

这是 25 个单词后的计时（不多，但 W=22，而且我的电脑相当慢，已经花了 2 天）

\n

11.236965486081317\n0.4795242650434375\n0.08659004187211394\n

Run Code Online (Sandbox Code Playgroud)\n

C后缀表示带修饰的函数lru_cache。

\n

所以，我必须说，我没有看到与 Kelly 相同的结果。\nt是每次迭代的时间（这里仅是第 25 次迭代）。\xce\xbc 是平均时间（所有 25 个示例的总时间除以 25）。然后我计算一个示例的计时与“naive”版本的同一示例的计时之间的比率（因此，对于“naive”行，这些比率始终为 100%）。\xcf\x81 是 25 个示例的比率的平均值。括号之间是这 25 个示例的该比率的最小值和最大值。

\n

结论是

\n

lru_cache平均而言，对“天真的”版本略有帮助。但只是非常轻微。而不适用于其他版本。
时序比平均为 x30，而不是 x100。x100 很幸运。但请注意，这也是最坏的情况。原因是有些示例需要花费很长时间才能运行，而这些示例的比率最不利。然而，它仍然是一个 x30 的因素。请注意，最好的情况是如此之高，以至于我可以测量它（因为我选择只打印 2 位小数）。但这意味着最好的情况是时间lessNaive比“naive”版本少 0.005% 的情况。这意味着该比率最高可达 x20000，有利于lessNaive. 并且，再次强调，永远不会低于 x30 有利于lessNaive.

\n

因此，时间安排确实有很大差异。但永远不会达到更慢的程度lessNaive。此外，这似乎几乎不可能，因为它是完全相同的代码，只是停止了一些递归。如果它根本无法减少任何递归，它可能会慢一点：我们将支付测试成本，但没有任何回报。但该测试的成本比其余代码（pop 等）要低得多。即使在所有 25 个示例中，我们从未遇到过速度lessNaive不是至少 30 倍的情况，理论上也可以构建一个速度稍慢的示例（没有递归被行中止的示例if sofar>=bestScore or......）但我认为慢 50% 是不可能的。

\n

归档时间：	2 年，10 月前
查看次数：	772 次
最近记录：	2 年，10 月前