寻找茎和结尾的组合

Len*_*bro 10 python

我有"词干"和"结尾"(可能不是正确的单词)的映射,如下所示:

all_endings = {
 'birth': set(['place', 'day', 'mark']), 
 'snow': set(['plow', 'storm', 'flake', 'man']),
 'shoe': set(['lace', 'string', 'maker']),
 'lock': set(['down', 'up', 'smith']),
 'crack': set(['down', 'up',]),
 'arm': set(['chair']),
 'high': set(['chair']),
 'over': set(['charge']),
 'under': set(['charge']),
}
Run Code Online (Sandbox Code Playgroud)

但当然要长得多.我也用相反的方式制作了相应的字典:

all_stems = {
 'chair': set(['high', 'arm']),
 'charge': set(['over', 'under']),
 'up': set(['lock', 'crack', 'vote']),
 'down': set(['lock', 'crack', 'fall']),
 'smith': set(['lock']),
 'place': set(['birth']),
 'day': set(['birth']),
 'mark': set(['birth']),
 'plow': set(['snow']),
 'storm': set(['snow']),
 'flake': set(['snow']),
 'man': set(['snow']),
 'lace': set(['shoe']),
 'string': set(['shoe']),
 'maker': set(['shoe']),
}
Run Code Online (Sandbox Code Playgroud)

我现在试图想出一个算法来找到两个或多个匹配两个或多个"结尾"的"茎"的匹配.例如,在上面,它将与锁定和裂缝向下和向上匹配,从而产生

lockdown
lockup
crackdown
crackup
Run Code Online (Sandbox Code Playgroud)

但不包括'upvote', 'downfall' or 'locksmith'(这是导致我最大问题的原因).我得到的误报如下:

pancake
cupcake
cupboard
Run Code Online (Sandbox Code Playgroud)

但我只是在"循环"中走了一圈.(双关语意图),我似乎无处可去.我很欣赏任何正确的方向.

到目前为止,混淆和无用的代码,您可能应该忽略它们:

findings = defaultdict(set)
for stem, endings in all_endings.items():
    # What stems have matching endings:
    for ending in endings:
        otherstems = all_stems[ending]
        if not otherstems:
            continue
        for otherstem in otherstems:
            # Find endings that also exist for other stems
            otherendings = all_endings[otherstem].intersection(endings)
            if otherendings:
                # Some kind of match
                findings[stem].add(otherstem)

# Go through this in order of what is the most stems that match:

MINMATCH = 2
for match in sorted(findings.values(), key=len, reverse=True):
    for this_stem in match:
        other_stems = set() # Stems that have endings in common with this_stem
        other_endings = set() # Endings this stem have in common with other stems
        this_endings = all_endings[this_stem]
        for this_ending in this_endings:
            for other_stem in all_stems[this_ending] - set([this_stem]):
                matching_endings = this_endings.intersection(all_endings[other_stem])
                if matching_endings:
                    other_endings.add(this_ending)
                    other_stems.add(other_stem)

        stem_matches = all_stems[other_endings.pop()]
        for other in other_endings:
            stem_matches = stem_matches.intersection(all_stems[other])

        if len(stem_matches) >= MINMATCH:
            for m in stem_matches:
                for e in all_endings[m]:
                    print(m+e)
Run Code Online (Sandbox Code Playgroud)

ire*_*ses 4

它不是特别漂亮,但是如果您将字典分解为两个列表并使用显式索引,这将非常简单:

all_stems = {
 'chair' : set(['high', 'arm']),
 'charge': set(['over', 'under']),
 'fall'  : set(['down', 'water', 'night']),
 'up'    : set(['lock', 'crack', 'vote']),
 'down'  : set(['lock', 'crack', 'fall']),
}

endings     = all_stems.keys()
stem_sets   = all_stems.values()

i = 0
for target_stem_set in stem_sets:
    i += 1
    j  = 0

    remaining_stems = stem_sets[i:]
    for remaining_stem_set in remaining_stems:
        j += 1
        union = target_stem_set & remaining_stem_set
        if len(union) > 1:
            print "%d matches found" % len(union)
            for stem in union:
                print "%s%s" % (stem, endings[i-1])
                print "%s%s" % (stem, endings[j+i-1])
Run Code Online (Sandbox Code Playgroud)

输出:

$ python stems_and_endings.py 
2 matches found
lockdown
lockup
crackdown
crackup
Run Code Online (Sandbox Code Playgroud)

基本上,我们所做的就是依次迭代每个集合,并将其与每个剩余的集合进行比较,看看是否有两个以上的匹配项。我们永远不必尝试早于当前集合的集合,因为它们已经在先前的迭代中进行了比较。其余的(索引等)只是簿记。