pbh*_*bhj 9 database comparison list recordset fuzzy-comparison
我有两个歌曲标题列表,每个都是纯文本文件,这是许可的歌词文件的文件名 - 我想检查较短的列表标题(针)是否在更长的列表(haystack).脚本/应用程序应返回针头中不在大海捞针中的标题列表.
我更喜欢使用Python或shell脚本(BASH),或者只使用可以处理所需模糊性的可视差异程序.
主要问题是标题需要模糊匹配以解决数据输入错误以及可能还有字排序.
干草堆样本(注意一些重复和接近重复的行,突出显示匹配):
Yearn
Yesterday, Today And Forever
Yesterday, Today, Forever
You
You Alone
You Are Here (The Same Power)
You Are Holy
You Are Holy (Prince Of Peace)
You Are Mighty
You Are Mine
You Are My All In All
You Are My Hiding Place
You Are My King (Amazing Love)
You Are Righteous (Hope)
You Are So Faithful
You Are So Good to Me
You Are Worthy Of My Praise
You Have Been Good
You Led Me To The Cross
You Reign
You Rescued Me
You Said
You Sent Your Own
You Set Me Apart (Dwell In Your House)
You alone are worthy (Glory in the highest)
You are God in heaven (let my words be few)
You are always fighting for us (Hallelujah you have overcome)
You are beautiful (I stand in awe)
You are beautiful beyond description
You are mighty
You are my all in all
You are my hiding place
You are my passion
You are still Holy
You are the Holy One (We exalt Your name)
You are the mighty King
You are the mighty warrior
You are the vine
**You chose the cross (Lost in wonder)**
You have shown me favour unending
You hold the broken hearted
You laid aside Your majesty
You said
You're Worthy Of My Praise
You're calling me (Unashamed love)
You're the God of this city
You're the Lion of Judah
You're the word of God the Father (Across the lands)
You've put a new song in my heart
Your Beloved
Your Grace is Enough
Your Great Name We Praise
Your Great Name We Praise-2
Your Light (You Have Turned)
Your Light Is Over Me (His Love)
**Your Love**
**Your Love Is Amazing**
Your Love Is Deep
Your Love Is Deeper - Jesus, Lord of Heaven (Phil Wickham)
Your Love Oh Lord
Your Love Oh Lord (Psalm 36)
Your Love is Extravagant
Your Power (Send Me)
Your blood speaks a better word
Your everlasting love
**Your grace is enough**
**Your grace is enough (Great is Your faithfulness)**
Your mercy is falling
Your mercy taught us how to dance (Dancing generation)
Your voice stills the oceans (nothing is impossible)
Yours Is The Kingdom
Run Code Online (Sandbox Code Playgroud)
针样品:
You Are Good (I Want To Scream It Out)
You Are My Strength (In The Fullness)
You Are My Vision O King Of My Heart
You Are The King Of Glory (Hosanna To The Son)
**You Chose The Cross (Lost In Wonder)**
**Your Grace Is Enough (This Is Our God)**
**Your Love Is Amazing Steady And Unchanging**
**Your Love Shining Like The Sun**
Run Code Online (Sandbox Code Playgroud)
请注意,标题为"你的爱像太阳一样闪耀"只是"你的爱"的可能匹配.最好不能不匹配,因此任何不确定的标题匹配都应出现在输出中.
comm -1 -3 <(sort haystack.txt) <(sort needle.txt)
Run Code Online (Sandbox Code Playgroud)
没有找到任何比赛.diff或者grep似乎他们有同样的问题,不够模糊.Kdiff3并且与diffnow.com手动比较一样快,因为我仍然需要扫描几乎所有的比赛,他们只能处理空格和字母差异.
ExamDiffPro 来自prestosoft.com看起来像是一种可能性,但只是MS Windows,在我搞乱WINE或VirtualBox之前,我更喜欢原生的Linux解决方案.
针实际上是一个CSV,所以我考虑使用LibreOffice并将其作为数据库处理并进行SQL查询或使用带有hlookup或其他东西的电子表格...... 另一个问题让我想到了OpenRefine(以前的google-refine)
似乎这是一个常见的问题类别(它基本上是"记录链接",经常使用[Levenshtein]编辑距离计算),我应该如何处理它?建议好吗?
您可能想看看 fuzzywuzzy ( https://github.com/seatgeek/fuzzywuzzy )。
from fuzzywuzzy import process
needles = open('needle').read().split("\n")
haystack = open('haystack').read().split("\n")
for a in needles:
print a + ' -> ',
print process.extractBests(a, haystack, score_cutoff=90)
Run Code Online (Sandbox Code Playgroud)
提取函数的有用参数是 limit、scorer 和processor。