我有"词干"和"结尾"(可能不是正确的单词)的映射,如下所示:
all_endings = {
'birth': set(['place', 'day', 'mark']),
'snow': set(['plow', 'storm', 'flake', 'man']),
'shoe': set(['lace', 'string', 'maker']),
'lock': set(['down', 'up', 'smith']),
'crack': set(['down', 'up',]),
'arm': set(['chair']),
'high': set(['chair']),
'over': set(['charge']),
'under': set(['charge']),
}
Run Code Online (Sandbox Code Playgroud)
但当然要长得多.我也用相反的方式制作了相应的字典:
all_stems = {
'chair': set(['high', 'arm']),
'charge': set(['over', 'under']),
'up': set(['lock', 'crack', 'vote']),
'down': set(['lock', 'crack', 'fall']),
'smith': set(['lock']),
'place': set(['birth']),
'day': set(['birth']),
'mark': set(['birth']),
'plow': set(['snow']),
'storm': set(['snow']),
'flake': set(['snow']),
'man': set(['snow']),
'lace': set(['shoe']),
'string': set(['shoe']),
'maker': set(['shoe']),
}
Run Code Online (Sandbox Code Playgroud)
我现在试图想出一个算法来找到两个或多个匹配两个或多个"结尾"的"茎"的匹配.例如,在上面,它将与锁定和裂缝向下和向上匹配,从而产生
lockdown
lockup
crackdown
crackup
Run Code Online (Sandbox Code Playgroud)
但不包括'upvote', 'downfall' or 'locksmith'(这是导致我最大问题的原因).我得到的误报如下:
pancake
cupcake
cupboard
Run Code Online (Sandbox Code Playgroud)
但我只是在"循环"中走了一圈.(双关语意图),我似乎无处可去.我很欣赏任何正确的方向.
到目前为止,混淆和无用的代码,您可能应该忽略它们:
findings = defaultdict(set)
for stem, endings in all_endings.items():
# What stems have matching endings:
for ending in endings:
otherstems = all_stems[ending]
if not otherstems:
continue
for otherstem in otherstems:
# Find endings that also exist for other stems
otherendings = all_endings[otherstem].intersection(endings)
if otherendings:
# Some kind of match
findings[stem].add(otherstem)
# Go through this in order of what is the most stems that match:
MINMATCH = 2
for match in sorted(findings.values(), key=len, reverse=True):
for this_stem in match:
other_stems = set() # Stems that have endings in common with this_stem
other_endings = set() # Endings this stem have in common with other stems
this_endings = all_endings[this_stem]
for this_ending in this_endings:
for other_stem in all_stems[this_ending] - set([this_stem]):
matching_endings = this_endings.intersection(all_endings[other_stem])
if matching_endings:
other_endings.add(this_ending)
other_stems.add(other_stem)
stem_matches = all_stems[other_endings.pop()]
for other in other_endings:
stem_matches = stem_matches.intersection(all_stems[other])
if len(stem_matches) >= MINMATCH:
for m in stem_matches:
for e in all_endings[m]:
print(m+e)
Run Code Online (Sandbox Code Playgroud)
它不是特别漂亮,但是如果您将字典分解为两个列表并使用显式索引,这将非常简单:
all_stems = {
'chair' : set(['high', 'arm']),
'charge': set(['over', 'under']),
'fall' : set(['down', 'water', 'night']),
'up' : set(['lock', 'crack', 'vote']),
'down' : set(['lock', 'crack', 'fall']),
}
endings = all_stems.keys()
stem_sets = all_stems.values()
i = 0
for target_stem_set in stem_sets:
i += 1
j = 0
remaining_stems = stem_sets[i:]
for remaining_stem_set in remaining_stems:
j += 1
union = target_stem_set & remaining_stem_set
if len(union) > 1:
print "%d matches found" % len(union)
for stem in union:
print "%s%s" % (stem, endings[i-1])
print "%s%s" % (stem, endings[j+i-1])
Run Code Online (Sandbox Code Playgroud)
输出:
$ python stems_and_endings.py
2 matches found
lockdown
lockup
crackdown
crackup
Run Code Online (Sandbox Code Playgroud)
基本上,我们所做的就是依次迭代每个集合,并将其与每个剩余的集合进行比较,看看是否有两个以上的匹配项。我们永远不必尝试早于当前集合的集合,因为它们已经在先前的迭代中进行了比较。其余的(索引等)只是簿记。
| 归档时间: |
|
| 查看次数: |
339 次 |
| 最近记录: |