算法：哪组长度为 N 的图块可用于生成最多数量的 Scrabble 有效单词？

Question

算法：哪组长度为 N 的图块可用于生成最多数量的 Scrabble 有效单词？

我正在尝试创建一个函数best_tiles，该函数接收您手中的瓷砖数量并返回一组瓷砖，它允许您生成最多数量的唯一英语有效单词，假设您只能使用每个瓷砖一次。

例如，您手中的这组牌(A, B, C)可以拼出 CAB、BAC、AB 和 BA（所有这些都是英文单词），因此您可以用该组拼写 4 个独特的单词。使用(B, A, A)，您可以拼出 5 个单词：ABA、BAA、AA、AB 和 BA。目标是找到一组字母，使您可以拼出最多数量的英语有效单词（无需替换）。

因此，如果 5 是 N = 3 的任意字母组合可以拼写的最大单词数，则运行best_tiles( n = 3 )将返回B, A, A.

我想知道如何有效地实现这一点？我目前的方法在字母数量方面根本不能很好地扩展。

我读了一个词表。在这种情况下，我在这里使用 enable.txt：https ://www.wordgamedictionary.com/enable/

import os
path = "enable.txt"
words = []
with open(path , encoding='utf8') as f: 
    for values in f:
        words.append(list(values.strip().upper()))

Run Code Online (Sandbox Code Playgroud)

我创建了一个函数word_in_tilesh/t smack89 ，它返回是否有可能在给定瓷砖集的情况下构造一个单词：

def word_in_tiles(word, tiles):
  tiles_counter = collections.Counter(tiles)
  return all(tiles_counter.get(ch, 0) == cnt for ch,cnt in 
  collections.Counter(word).items())

Run Code Online (Sandbox Code Playgroud)

然后我创建了一个函数get_all_words，该函数生成一个列表，其中包含一个人可以从单词列表和拼贴集中拼出的所有可能单词。

def get_all_words(tile_set, words):
    # words is a word list
    return [i for i in words if word_in_tiles(i, tile_set)]

Run Code Online (Sandbox Code Playgroud)

确定哪个tileset是三个字母的“最佳”的极其天真的方法如下：

我首先创建一个给定长度的所有可能组合的列表。所以对于长度 3，我会这样做：

import string
import itertools 

letters = string.ascii_lowercase
all_three_letter_combinations = list(itertools.combinations_with_replacement(letters, 3))

# Create a list of only words that are three letters are less
three_letter = [i for i in words if len(i) <= 3]

sequence_dict = dict()
    for i in all_three_letter_combinations:
        string_test = "".join(i).upper()
        sequence_dict[i] = get_all_words(string_test, three_letter)

Run Code Online (Sandbox Code Playgroud)

然后删除没有长度的值并按结果的长度排序：

res = {k: v for k, v in sequence_dict.items() if len(v) >= 1}

def GetMaxLength(res):
    return max((len(v), v, k) for k, v in res.items())[1:]

Run Code Online (Sandbox Code Playgroud)

GetMaxLength(res) 你知道，对于三个字母，产生最多英语有效词的图块集是T A E可以产生以下词的['AE', 'AT', 'ATE', 'EAT', 'ET', 'ETA', 'TA', 'TAE', 'TEA']

我希望能够将其扩展到 N = 15。这样做的最佳程序是什么？

Answer 1

bti*_*lly 4

我想这已经足够好了！

这是我在 PyPy 下运行的代码的日志：

0:00:00.000232
E
0:00:00.001251
ER
0:00:00.048733
EAT
0:00:00.208744
ESAT
0:00:00.087425
ESATL
0:00:00.132049
ESARTP
0:00:00.380296
ESARTOP
0:00:01.409129
ESIARTLP
0:00:03.433526
ESIARNTLP
0:00:10.391252
ESIARNTOLP
0:00:25.651012
ESIARNTOLDP
0:00:56.642405
ESIARNTOLCDP
0:01:57.257293
ESIARNTOLCDUP
0:03:55.933906
ESIARNTOLCDUPM
0:07:17.146036
ESIARNTOLCDUPMG
0:10:14.844347
ESIARNTOLCDUPMGH
0:13:34.722600
ESIARNTOLCDEUPMGH
0:18:14.215019
ESIARNTOLCDEUPMGSH
0:22:47.129284
ESIARNTOLCDEUPMGSHB
0:27:56.859511
ESIARNTOLCDEUPMGSHBYK
0:46:20.448502
ESIARNTOLCDEUPMGSHBYAK
0:57:15.213635
ESIARNTOLCDEUPMGSHIBYAT
1:09:55.530180
ESIARNTOLCDEUPMGSHIBYATF
1:18:35.209599
ESIARNTOLCDEUPMGSHIBYATRF
1:21:54.095119
ESIARNTOLCDEUPMGSHIBYATRFV
1:20:16.978411
ESIARNTOLCDEUPMGSHIBYAOTRFV
1:14:24.253660
ESIARNTOLCDEUPMGSHIBYAONTRFV
1:00:37.405571

Run Code Online (Sandbox Code Playgroud)

关键的改进是这些。

我不仅区分字母，还区分这封信被看到的次数。因此，我可以接受或继续前进的每一封信。这是我在评论 David Eisenstat 的解决方案时得到的一个想法。
从他那里我还得到了这样的想法，即修剪掉那些无法解决问题的树木可以出人意料地很好地控制问题的发展。
我看到的第一个解决方案就是所有最上面的字母。这开始是一个非常好的解决方案，因此尽管它是深度优先，但我们会进行很好的修剪。
我小心翼翼地将“用尽的尝试”合并成一个记录。这减少了我们必须丢弃的数据量。

这是代码。

import os
import datetime
path = "enable.txt"
words = []
with open(path) as f:
    for values in f:
        words.append(values.strip().upper())

key_count = {}
for word in words:
    seen = {}
    for letter in word:
        if letter not in seen:
            seen[letter] = 0
        key = (letter, seen[letter])
        if key not in key_count:
            key_count[key] = 1
        else:
            key_count[key] += 1
        seen[letter] += 1


KEYS = sorted(key_count.keys(), key=lambda key: -key_count[key])
#print(KEYS)
#print(len(KEYS))
KEY_POS = {}
for i in range(len(KEYS)):
    KEY_POS[KEYS[i]] = i

# Now we will build a trie.  Every node has a list of words, and a dictionary
# from the next letter farther in the trie.
# BUT TRICK:, we will map each word to a sequence of numbers, and those numbers
# will be indexes into KEYS.  This allows us to use the fact that a second 'e' is
# unlikely, so we can deal with that efficiently.
class Trie:
    def __init__(self, path):
        self.words = []
        self.dict = {}
        self.min_pos = -1
        self.max_pos = -1
        self.words = []
        self.count_words = 0
        self.path = path

    def add_word (self, word):
        trie = self

        poses = []
        seen = {}
        for letter in word:
            if letter not in seen:
                seen[letter] = 0
            key = (letter, seen[letter])
            poses.append(KEY_POS[(key)])
            seen[letter] += 1
        sorted_poses = sorted(poses);
        for i in range(len(sorted_poses)):
            trie.count_words += 1
            pos = sorted_poses[i]
            if pos not in trie.dict:
                trie.dict[pos] = Trie(trie.path + KEYS[pos][0])
                if trie.max_pos < pos:
                    trie.max_pos = pos
            trie = trie.dict[pos]
        trie.count_words += 1
        trie.words.append(word)


base_trie = Trie('')
for word in words:
    base_trie.add_word(word);

def best_solution (size):
    def solve (subset, pos, best, partial):
        found = sum(x[0] for x in partial)
        upper_bound = sum(x[1] for x in partial)
        if size <= len(subset) or upper_bound < best or len(KEYS) <= pos:
            return (found, subset)
        if best < found:
            best = found
        # Figure out our next calculations.
        partial_include = []
        partial_exclude = []
        finalized_found = 0
        for this_found, this_bound, this_trie in partial:
            if this_trie is None:
                # This is a generic record of already emptied tries
                finalized_found += this_found
            elif pos in this_trie.dict:
                include_trie = this_trie.dict[pos]
                partial_include.append((
                    this_found + len(include_trie.words),
                    include_trie.count_words + this_found,
                    include_trie
                ))
                # We included the tally of found words in the previous partial.
                # So do not double-count by including it again
                partial_include.append((
                    0,
                    this_bound - include_trie.count_words - this_found,
                    this_trie
                ))
                partial_exclude.append((
                    this_found,
                    this_bound - include_trie.count_words,
                    this_trie
                ))
            elif this_found == this_bound:
                finalized_found += this_found
            else:
                partial_include.append((
                    this_found,
                    this_bound,
                    this_trie
                ))

                partial_exclude.append((
                    this_found,
                    this_bound,
                    this_trie
                ))
        if 0 < finalized_found:
            partial_include.append(
                (finalized_found, finalized_found, None)
            )
            partial_exclude.append(
                (finalized_found, finalized_found, None)
            )

        found_include, subset_include = solve(subset + [pos], pos+1, best, partial_include)
        if best < found_include:
            best = found_include
        found_exclude, subset_exclude = solve(subset, pos+1, best, partial_exclude)
        if found_include < found_exclude:
            return (found_exclude, subset_exclude)
        else:
            return (found_include, subset_include)


    count, subset = solve([], 0, 0, [(len(base_trie.words), base_trie.count_words, base_trie)])
    return ''.join([KEYS[x][0] for x in subset])

for i in range(20):
    start = datetime.datetime.now()
    print(best_solution(i))
    print(datetime.datetime.now() - start)

Run Code Online (Sandbox Code Playgroud)

归档时间：	4 年，11 月前
查看次数：	509 次
最近记录：	4 年，11 月前