我有一些类似于这些丑陋的字符串:
string1 = 'Fantini, Rauch, C.Straus, Priuli, Bertali: 'Festival Mass at the Imperial Court of Vienna, 1648' (Yorkshire Bach Choir & Baroque Soloists + Baroque Brass of London/Seymour)'
string2 = 'Vinci, Leonardo {c.1690-1730}: Arias from Semiramide Riconosciuta, Didone Abbandonata, La Caduta dei Decemviri, Lo Cecato Fauzo, La Festa de Bacco, Catone in Utica. (Maria Angeles Peters sop. w.M.Carraro conducting)'
Run Code Online (Sandbox Code Playgroud)
我想一个库或算法,这将使我的,他们有多少的话有共同的一个百分比,而不含特殊字符,如','和':'和'''和'{'等.
我知道Levenshtein算法.然而,这比较了类似CHARACTERS的数量,而我想比较它们共有多少个词
正则表达式可以很容易地给你所有的话:
import re
s1 = "Fantini, Rauch, C.Straus, Priuli, Bertali: 'Festival Mass at the Imperial Court of Vienna, 1648' (Yorkshire Bach Choir & Baroque Soloists + Baroque Brass of London/Seymour)"
s2 = "Vinci, Leonardo {c.1690-1730}: Arias from Semiramide Riconosciuta, Didone Abbandonata, La Caduta dei Decemviri, Lo Cecato Fauzo, La Festa de Bacco, Catone in Utica. (Maria Angeles Peters sop. w.M.Carraro conducting)"
s1w = re.findall('\w+', s1.lower())
s2w = re.findall('\w+', s2.lower())
Run Code Online (Sandbox Code Playgroud)
collections.Counter (Python 2.7+)可以快速计算单词出现的次数.
from collections import Counter
s1cnt = Counter(s1w)
s2cnt = Counter(s2w)
Run Code Online (Sandbox Code Playgroud)
一个非常粗略的比较可以通过set.intersection或完成difflib.SequenceMatcher,但听起来你想要实现一个处理单词的Levenshtein算法,你可以使用这两个列表.
common = set(s1w).intersection(s2w)
# returns set(['c'])
import difflib
common_ratio = difflib.SequenceMatcher(None, s1w, s2w).ratio()
print '%.1f%% of words common.' % (100*common_ratio)
Run Code Online (Sandbox Code Playgroud)
打印: 3.4% of words similar.
| 归档时间: |
|
| 查看次数: |
3837 次 |
| 最近记录: |