如何使用 Hazm 规范化波斯文本

Question

如何使用 Hazm 规范化波斯文本

我有一个包含一些其他文件夹的文件夹，每个文件夹都包含很多文本文件。我必须在特定单词前后提取5 个单词，并且以下代码工作正常。

问题是因为我没有对文本进行标准化，所以它只会返回几句话，而还有更多。在波斯语中，有一个名为hazm的模块用于规范化文本。我如何在这段代码中使用它？

例如规范化：“？” 应该改为“？” 或“？” 应该改为“？”。因为前两个实际上是在波斯语中使用的阿拉伯字母。没有规范化代码只返回用第二种形式写的单词，它不能识别第一种形式的单词阿拉伯语）。

import os from hazm import Normalizer def getRollingWindow(seq, w): win = [next(seq) for _ in range(11)] yield win for e in seq: win[:-1] = win[1:] win[-1] = e yield win def extractSentences(rootDir, searchWord): with open("????", "w", encoding="utf-8") as outfile: for root, _dirs, fnames in os.walk(rootDir): for fname in fnames: print("Looking in", os.path.join(root, fname)) with open(os.path.join(root, fname), encoding = "utf-8") as infile: #normalizer = Normalizer() #fname = normalizer.normalize(fname) for window in getRollingWindow((word for line in infile for word in line(normalizer.normalize(line)).split()), 11): if window[5] != searchWord: continue outfile.write(' '.join(window)+ "\n")
Run Code Online (Sandbox Code Playgroud)

Answer 1

Ami*_*mir 6

我没有使用Hazm 的经验，但是使用以下代码可以很容易地将其标准化。（注意这里我们只是用波斯语替换阿拉伯字符）

def clean_sentence(sentence):
    sentence = arToPersianChar(sentence)
    sentence = arToPersianNumb(sentence)
    # more_normalization_function()
    return sentence


def arToPersianNumb(number):
    dic = {
        '?': '?',
        '?': '?',
        '?': '?',
        '?': '?',
        '?': '?',
        '?': '?',
        '?': '?',
        '?': '?',
        '?': '?',
        '?': '?',
    }
    return multiple_replace(dic, number)


def arToPersianChar(userInput):
    dic = {
        '?': '?',
        '??': '?',
        '??': '?',
        '??': '?',
        '??': '?',
        '??': '?',
        '??': '?',
        '?': '?',
        '?': '?'
}
return multiple_replace(dic, userInput)


def multiple_replace(dic, text):
    pattern = "|".join(map(re.escape, dic.keys()))
    return re.sub(pattern, lambda m: dic[m.group()], str(text))

Run Code Online (Sandbox Code Playgroud)

只需要阅读文档的每一行并将其传递给clean_sentence()：

def clean_all(document):
    clean = ''
    for sentence in document:
        sentence = clean_sentence(sentence)
        clean += ' \n' + sentence
    return clean

Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年，11 月前
查看次数：	1604 次
最近记录：	4 年，10 月前