比较名称之间的相似性

Lui*_*uez 3 python nlp machine-learning

我必须对基于名称的某些数据进行交叉验证。

我面临的问题是,根据来源,名称会有细微的变化,例如:

L & L AIR CONDITIONING   vs L & L AIR CONDITIONING Service

BEST ROOFING vs ROOFING INC
Run Code Online (Sandbox Code Playgroud)

我有成千上万的记录,因此手动执行将非常耗时,我想尽可能地使流程自动化。

由于还有其他单词,仅用小写字母是不够的。

有哪些好的算法可以解决这个问题?

也许要计算相关性,使“ INC”或“ Service”等词的权重较低

编辑:

我尝试了difflib库

difflib.SequenceMatcher(None,name_1.lower(),name_2.lower()).ratio()
Run Code Online (Sandbox Code Playgroud)

我得到了不错的结果。

Sre*_*ary 6

我将使用余弦相似度来实现相同。它将为您提供与弦的接近程度相匹配的分数。

这是可以帮助您的代码(几个月前,我记得从Stackoverflow本身获取了此代码-现在找不到链接)

import re, math
from collections import Counter

WORD = re.compile(r'\w+')

def get_cosine(vec1, vec2):
    # print vec1, vec2
    intersection = set(vec1.keys()) & set(vec2.keys())
    numerator = sum([vec1[x] * vec2[x] for x in intersection])

    sum1 = sum([vec1[x]**2 for x in vec1.keys()])
    sum2 = sum([vec2[x]**2 for x in vec2.keys()])
    denominator = math.sqrt(sum1) * math.sqrt(sum2)

    if not denominator:
        return 0.0
    else:
        return float(numerator) / denominator

def text_to_vector(text):
    return Counter(WORD.findall(text))

def get_similarity(a, b):
    a = text_to_vector(a.strip().lower())
    b = text_to_vector(b.strip().lower())

    return get_cosine(a, b)

get_similarity('L & L AIR CONDITIONING', 'L & L AIR CONDITIONING Service') # returns 0.9258200997725514
Run Code Online (Sandbox Code Playgroud)

我发现有用的另一个版本是基于NLP的,并且是我编写的。

import re, math
from collections import Counter
from nltk.corpus import stopwords
from nltk.stem.porter import *
from nltk.corpus import wordnet as wn

stop = stopwords.words('english')

WORD = re.compile(r'\w+')
stemmer = PorterStemmer()

def get_cosine(vec1, vec2):
    # print vec1, vec2
    intersection = set(vec1.keys()) & set(vec2.keys())
    numerator = sum([vec1[x] * vec2[x] for x in intersection])

    sum1 = sum([vec1[x]**2 for x in vec1.keys()])
    sum2 = sum([vec2[x]**2 for x in vec2.keys()])
    denominator = math.sqrt(sum1) * math.sqrt(sum2)

    if not denominator:
        return 0.0
    else:
        return float(numerator) / denominator

def text_to_vector(text):
    words = WORD.findall(text)
    a = []
    for i in words:
        for ss in wn.synsets(i):
            a.extend(ss.lemma_names())
    for i in words:
        if i not in a:
            a.append(i)
    a = set(a)
    w = [stemmer.stem(i) for i in a if i not in stop]
    return Counter(w)

def get_similarity(a, b):
    a = text_to_vector(a.strip().lower())
    b = text_to_vector(b.strip().lower())

    return get_cosine(a, b)

def get_char_wise_similarity(a, b):
    a = text_to_vector(a.strip().lower())
    b = text_to_vector(b.strip().lower())
    s = []

    for i in a:
        for j in b:
            s.append(get_similarity(str(i), str(j)))
    try:
        return sum(s)/float(len(s))
    except: # len(s) == 0
        return 0

get_similarity('I am a good boy', 'I am a very disciplined guy')
# Returns 0.5491201525567068
Run Code Online (Sandbox Code Playgroud)

您可以同时调用get_similarityget_char_wise_similarity查看哪种方法更适合您的用例。我同时使用了两种方法-正常相似性可以清除非常接近的相似性,然后使用字符明智的相似性来清除足够接近的相似性。然后,其余的必须手动处理。