文本规范化：Python中的文本相似性。如何规范文本拼写不匹配？

Question

文本规范化：Python中的文本相似性。如何规范文本拼写不匹配？

我有一个带有A列的数据框，如下所示：

Column A
Carrefour supermarket
Carrefour hypermarket
Carrefour
carrefour
Carrfour downtown
Carrfor market
Lulu
Lulu Hyper
Lulu dxb
lulu airport
k.m trading
KM Trading
KM trade
K.M.  Trading
KM.Trading

Run Code Online (Sandbox Code Playgroud)

我想从下面的“ A列”中得出：

Column A
Carrefour
Carrefour
Carrefour
Carrefour
Carrefour
Carrefour
Lulu
Lulu
Lulu
Lulu
KM Trading
KM Trading
KM Trading
KM Trading
KM Trading

Run Code Online (Sandbox Code Playgroud)

为此，我编写如下代码：

MERCHANT_NAME_DICT = {"lulu": "Lulu", "carrefour": "Carrefour",  "km": "KM Trading"}

def replace_merchant_name(row):
    """Provided a long merchant name replace it with short name."""
    processed_row = re.sub(r'\s+|\.', '', row.lower()).strip()
    for key, value in MERCHANT_NAME_DICT.items():
        if key in processed_row:
            return value

    return row

frame['MERCHANT_NAME'] = frame['MERCHANT_NAME'].astype(str)
frame.MERCHANT_NAME = frame.MERCHANT_NAME.apply(lambda row: replace_merchant_name(row))

Run Code Online (Sandbox Code Playgroud)

但是我想使用NLP Logic并使其成为通用函数（而不是使用值进行映射）。只需调用通用函数并在任何类似的数据列上运行它即可获得所需的结果。我对NLP Concepts来说还很陌生，所以需要一些有关它的朋友的帮助。

注意：基本上，我希望使用通用的NLP方式编码，以从给定的列（或列表）中查找所有相似的单词。

Answer 1

Dav*_*ale 12

如果您没有一套黄金的“正确”商家名称，这听起来像是一个聚类问题。它可以通过巧妙的距离函数（如 Jindrich 的答案中的 Jaro-Winkler）和简单的聚类算法（例如凝聚聚类）来解决。

对文本进行聚类后，您可以从每个聚类中找到最具代表性的文本并将其替换为整个聚类。

import numpy as np
import re
import textdistance
# we will need scikit-learn>=0.21
from sklearn.cluster import AgglomerativeClustering  

texts = [
  'Carrefour supermarket', 'Carrefour hypermarket', 'Carrefour', 'carrefour', 'Carrfour downtown', 'Carrfor market', 
  'Lulu', 'Lulu Hyper', 'Lulu dxb', 'lulu airport', 
  'k.m trading', 'KM Trading', 'KM trade', 'K.M.  Trading', 'KM.Trading'
]

def normalize(text):
  """ Keep only lower-cased text and numbers"""
  return re.sub('[^a-z0-9]+', ' ', text.lower())

def group_texts(texts, threshold=0.4): 
  """ Replace each text with the representative of its cluster"""
  normalized_texts = np.array([normalize(text) for text in texts])
  distances = 1 - np.array([
      [textdistance.jaro_winkler(one, another) for one in normalized_texts] 
      for another in normalized_texts
  ])
  clustering = AgglomerativeClustering(
    distance_threshold=threshold, # this parameter needs to be tuned carefully
    affinity="precomputed", linkage="complete", n_clusters=None
  ).fit(distances)
  centers = dict()
  for cluster_id in set(clustering.labels_):
    index = clustering.labels_ == cluster_id
    centrality = distances[:, index][index].sum(axis=1)
    centers[cluster_id] = normalized_texts[index][centrality.argmin()]
  return [centers[i] for i in clustering.labels_]

print(group_texts(texts))

Run Code Online (Sandbox Code Playgroud)

上面的代码将其输出打印为

['carrefour', 'carrefour', 'carrefour', 'carrefour', 'carrefour', 'carrefour', 
 'lulu', 'lulu', 'lulu', 'lulu', 
 'km trading', 'km trading', 'km trading', 'km trading', 'km trading']

Run Code Online (Sandbox Code Playgroud)

作为基线，此功能将完成。您可能希望通过修改距离函数来改进它，以便它更接近地反映您的域。例如：

考虑同义词：supermarket=hypermarket=market
词形还原（以便交易=交易）
给不重要的词赋予较小的权重（IDF？）

不幸的是，大多数此类调整都是针对特定领域的，因此您必须将它们调整为您自己的数据集。

Answer 2

Jin*_*ich 5

您可以执行以下操作：对于不在拼写检查词汇表中的每个单词（因此很可能拼错），查看您的商家名称列表，看看是否有一个名称的编辑距离很小。您还可以以某种方式对相似性搜索的单词进行标准化，即，将所有内容小写并删除标点符号。

您可以使用textdistance实现大量字符串距离的包。为此，我可能会使用Jaro-Winkler 距离。

import string
import textdistance

MERCHANT_NAMES = [("lulu", "Lulu"), ("carrefour", "Carrefour"),  ("km", "KM")]
DISTANCE_THRESHOLD = 0.9

def normalize(orig_name):
    name_sig = orig_name.translate(str.maketrans('', '', string.punctuation)).lower()

    best_score = DISTANCE_THRESHOLD
    replacement = name

    for sig, name in MERCHANT_NAMES:
        distance = textdistance.jaro_winkler(name_sig, sig)
        if distance > best_score:
            best_score = distance
            replacement = name
    return replacement

Run Code Online (Sandbox Code Playgroud)

您可能需要调整可接受的词替换阈值，并使用多词表达做一些事情。（例如，扔掉类似于“supermarket”、“hypermarket”等的词）

归档时间：	7 年，1 月前
查看次数：	301 次
最近记录：	7 年，1 月前