如何在python中获取两个字符串之间的所有模糊匹配子串?

San*_*ta7 5 python string fuzzy-search

假设我有三个示例字符串

text1 = "Patient has checked in for abdominal pain which started 3 days ago. Patient was prescribed idx 20 mg every 4 hours."
text2 = "The time of discomfort was 3 days ago."
text3 = "John was given a prescription of idx, 20mg to be given every four hours"
Run Code Online (Sandbox Code Playgroud)

如果我得到 text2 和 text3 与 text1 的所有匹配子字符串,我会得到

text1_text2_common = [
    '3 days ago.',
]

text2_text3_common = [
    'of',
]

text1_text3_common = [
    'was',
    'idx'
    'every'
    'hours'
]
Run Code Online (Sandbox Code Playgroud)

我正在寻找的是模糊匹配,使用诸如Levenshtein distance之类的东西。因此,即使子字符串不准确,如果它们对于标准足够相似,它也会被选为子字符串。

所以理想情况下我正在寻找这样的东西:

text1_text3_common_fuzzy = [
    'prescription of idx, 20mg to be given every four hours'
]
Run Code Online (Sandbox Code Playgroud)

fer*_*rdy 7

下面是通过 string1 的子串和 string2 的全串之间的模糊比来计算相似度的代码。该代码还可以处理 string2 的子字符串和 string1 的完整字符串以及 string1 的子字符串和 string2 的子字符串。

这个使用 nltk 生成 ngram。

典型算法:

  1. 从给定的第一个字符串生成 ngram。
    示例:
    text2 =“不适时间是 3 天前。”
    总长度 = 8

在代码中,参数的值为 5, 6, 7, 8。
param = 5
ngrams = ['不适时间为', '不适时间为 3', '不适为 3 天', '不适为 3 天前。']

  1. 将其与第二个字符串进行比较。
    示例:
    文本1 =Patient has checked in for abdominal pain which started 3 days ago. Patient was prescribed idx 20 mg every 4 hours.

@参数=5

  • 比较The time of discomfort wastext1获得模糊分数
  • 比较time of discomfort was 3text1获得模糊分数
  • 依此类推,直到ngrams_5中的所有元素都完成
    。如果模糊分数大于或等于给定阈值,则保存子字符串。

@参数=6

  • 比较The time of discomfort was 3text1获得模糊分数
  • 等等

直到@param=8

您可以修改代码,将 n_start 更改为 5 左右,以便将 string1 的 ngram 与 string2 的 ngram 进行比较,在本例中,这是 string1 的子字符串和 string2 的子字符串的比较。

# Generate ngrams for string2
n_start = 5  # st2_length
for n in range(n_start, st2_length + 1):
    ...

Run Code Online (Sandbox Code Playgroud)

为了进行比较,我使用:

fratio = fuzz.token_set_ratio(fs1, fs2)
Run Code Online (Sandbox Code Playgroud)

也看看这个。您也可以尝试不同的比例。

您的样本的'prescription of idx, 20mg to be given every four hours'模糊分数为 52。

请参阅示例控制台输出。

7                    prescription of idx, 20mg to be given every four hours           52
Run Code Online (Sandbox Code Playgroud)

代码

"""
fuzzy_match.py

/sf/ask/5041200251/

Dependent modules:
    pip install pandas
    pip install nltk
    pip install fuzzywuzzy
    pip install python-Levenshtein

"""


from nltk.util import ngrams
import pandas as pd
from fuzzywuzzy import fuzz


# Sample strings.
text1 = "Patient has checked in for abdominal pain which started 3 days ago. Patient was prescribed idx 20 mg every 4 hours."
text2 = "The time of discomfort was 3 days ago."
text3 = "John was given a prescription of idx, 20mg to be given every four hours"


def myprocess(st1: str, st2: str, threshold):
    """
    Generate sub-strings from st1 and compare with st2.
    The sub-strings, full string and fuzzy ratio will be saved in csv file.
    """
    data = []
    st1_length = len(st1.split())
    st2_length = len(st2.split())

    # Generate ngrams for string1
    m_start = 5
    for m in range(m_start, st1_length + 1):  # st1_length >= m_start

        # If m=3, fs1 = 'Patient has checked', 'has checked in', 'checked in for' ...
        # If m=5, fs1 = 'Patient has checked in for', 'has checked in for abdominal', ...
        for s1 in ngrams(st1.split(), m):
            fs1 = ' '.join(s1)
            
            # Generate ngrams for string2
            n_start = st2_length
            for n in range(n_start, st2_length + 1):
                for s2 in ngrams(st2.split(), n):
                    fs2 = ' '.join(s2)

                    fratio = fuzz.token_set_ratio(fs1, fs2)  # there are other ratios

                    # Save sub string if ratio is within threshold.
                    if fratio >= threshold:
                        data.append([fs1, fs2, fratio])

    return data


def get_match(sub, full, colname1, colname2, threshold=50):
    """
    sub: is a string where we extract the sub-string.
    full: is a string as the base/reference.
    threshold: is the minimum fuzzy ratio where we will save the sub string. Max fuzz ratio is 100.
    """   
    save = myprocess(sub, full, threshold)

    df = pd.DataFrame(save)
    if len(df):
        df.columns = [colname1, colname2, 'fuzzy_ratio']

        is_sort_by_fuzzy_ratio_first = True

        if is_sort_by_fuzzy_ratio_first:
            df = df.sort_values(by=['fuzzy_ratio', colname1], ascending=[False, False])
        else:
            df = df.sort_values(by=[colname1, 'fuzzy_ratio'], ascending=[False, False])

        df = df.reset_index(drop=True)

        df.to_csv(f'{colname1}_{colname2}.csv', index=False)

        # Print to console. Show only the sub-string and the fuzzy ratio. High ratio implies high similarity.
        df1 = df[[colname1, 'fuzzy_ratio']]
        print(df1.to_string())
        print()

        print(f'sub: {sub}')
        print(f'base: {full}')
        print()


def main():
    get_match(text2, text1, 'string2', 'string1', threshold=50)  # output string2_string1.csv
    get_match(text3, text1, 'string3', 'string1', threshold=50)

    get_match(text2, text3, 'string2', 'string3', threshold=10)

    # Other param combo.


if __name__ == '__main__':
    main()
Run Code Online (Sandbox Code Playgroud)

控制台输出

                                  string2  fuzzy_ratio
0              discomfort was 3 days ago.           72
1           of discomfort was 3 days ago.           67
2      time of discomfort was 3 days ago.           60
3                of discomfort was 3 days           59
4  The time of discomfort was 3 days ago.           55
5           time of discomfort was 3 days           51

sub: The time of discomfort was 3 days ago.
base: Patient has checked in for abdominal pain which started 3 days ago. Patient was prescribed idx 20 mg every 4 hours.

                                                                    string3  fuzzy_ratio
0                                                 be given every four hours           61
1                                    idx, 20mg to be given every four hours           58
2        was given a prescription of idx, 20mg to be given every four hours           56
3                                              to be given every four hours           56
4   John was given a prescription of idx, 20mg to be given every four hours           56
5                                 of idx, 20mg to be given every four hours           55
6              was given a prescription of idx, 20mg to be given every four           52
7                    prescription of idx, 20mg to be given every four hours           52
8            given a prescription of idx, 20mg to be given every four hours           52
9                  a prescription of idx, 20mg to be given every four hours           52
10        John was given a prescription of idx, 20mg to be given every four           52
11                                              idx, 20mg to be given every           51
12                                        20mg to be given every four hours           50

sub: John was given a prescription of idx, 20mg to be given every four hours
base: Patient has checked in for abdominal pain which started 3 days ago. Patient was prescribed idx 20 mg every 4 hours.

                                  string2  fuzzy_ratio
0      time of discomfort was 3 days ago.           41
1           time of discomfort was 3 days           41
2                time of discomfort was 3           40
3                of discomfort was 3 days           40
4  The time of discomfort was 3 days ago.           40
5           of discomfort was 3 days ago.           39
6       The time of discomfort was 3 days           39
7              The time of discomfort was           38
8            The time of discomfort was 3           35
9              discomfort was 3 days ago.           34

sub: The time of discomfort was 3 days ago.
base: John was given a prescription of idx, 20mg to be given every four hours
Run Code Online (Sandbox Code Playgroud)

CSV 输出示例

字符串2_字符串1.csv

在此输入图像描述

使用 Spacy 相似度

这是使用 spacy 比较 text3 的子字符串和 text1 的全文的结果

下面的结果旨在与上面的第二个表进行比较,看看哪种方法提供了更好的相似度排名。

我使用大模型得到下面的结果。

代码

import spacy
import pandas as pd


nlp = spacy.load("en_core_web_lg")

text1 = "Patient has checked in for abdominal pain which started 3 days ago. Patient was prescribed idx 20 mg every 4 hours."
text3 = "John was given a prescription of idx, 20mg to be given every four hours"

text3_sub = [
    'be given every four hours', 'idx, 20mg to be given every four hours',
    'was given a prescription of idx, 20mg to be given every four hours',
    'to be given every four hours',
    'John was given a prescription of idx, 20mg to be given every four hours',
    'of idx, 20mg to be given every four hours',
    'was given a prescription of idx, 20mg to be given every four',
    'prescription of idx, 20mg to be given every four hours',
    'given a prescription of idx, 20mg to be given every four hours',
    'a prescription of idx, 20mg to be given every four hours',
    'John was given a prescription of idx, 20mg to be given every four',
    'idx, 20mg to be given every',
    '20mg to be given every four hours'
]


data = []
for s in text3_sub:
    doc1 = nlp(s)
    doc2 = nlp(text1)
    sim = round(doc1.similarity(doc2), 3)
    data.append([s, text1, sim])

df = pd.DataFrame(data)
df.columns = ['from text3', 'text1', 'similarity']
df = df.sort_values(by=['similarity'], ascending=[False])
df = df.reset_index(drop=True)

df1 = df[['from text3', 'similarity']]
print(df1.to_string())

print()
print(f'text3: {text3}')
print(f'text1: {text1}')
Run Code Online (Sandbox Code Playgroud)

输出

                                                                 from text3  similarity
0        was given a prescription of idx, 20mg to be given every four hours       0.904
1   John was given a prescription of idx, 20mg to be given every four hours       0.902
2                  a prescription of idx, 20mg to be given every four hours       0.895
3                    prescription of idx, 20mg to be given every four hours       0.893
4            given a prescription of idx, 20mg to be given every four hours       0.892
5                                 of idx, 20mg to be given every four hours       0.889
6                                    idx, 20mg to be given every four hours       0.883
7              was given a prescription of idx, 20mg to be given every four       0.879
8         John was given a prescription of idx, 20mg to be given every four       0.877
9                                         20mg to be given every four hours       0.877
10                                              idx, 20mg to be given every       0.835
11                                             to be given every four hours       0.834
12                                                be given every four hours       0.832

text3: John was given a prescription of idx, 20mg to be given every four hours
text1: Patient has checked in for abdominal pain which started 3 days ago. Patient was prescribed idx 20 mg every 4 hours.
Run Code Online (Sandbox Code Playgroud)

看起来 spacy 方法产生了很好的相似度排名。