如何在 Python 中高效地对大型文本语料库使用拼写纠正

Question

如何在 Python 中高效地对大型文本语料库使用拼写纠正

Sam*_* S. -1 python text-processing spell-checking spelling

拼写更正时请考虑以下事项：

from autocorrect import spell
import re

WORD = re.compile(r'\w+')
def reTokenize(doc):
    tokens = WORD.findall(doc)
    return tokens

text = ["Hi, welcmoe to speling.","This is jsut an exapmle, but cosnider a veri big coprus."]
def spell_correct(text):
    sptext = []
    for doc in text:
        sptext.append(' '.join([spell(w).lower() for w in reTokenize(doc)]))      
    return sptext    

print(spell_correct(text))

Run Code Online (Sandbox Code Playgroud)

这是上面一段代码的输出：

如何停止在 jupyter 笔记本中显示输出？特别是如果我们有大量的文本文档，就会产生大量的输出。

我的第二个问题是：在大数据上应用时，如何提高代码的速度和准确性（例如，请检查输出中的“veri”一词）？有没有更好的方法来做到这一点？我感谢您以更快的速度做出回应和（替代）解决方案。

Answer 1

MrN*_*y33 6

正如@khelwood 在评论中所说，你应该使用autocorrect.Speller：

from autocorrect import Speller
import re


spell=Speller(lang="en")
WORD = re.compile(r'\w+')
def reTokenize(doc):
    tokens = WORD.findall(doc)
    return tokens

text = ["Hi, welcmoe to speling.","This is jsut an exapmle, but cosnider a veri big coprus."]
def spell_correct(text):
    sptext = []
    for doc in text:
        sptext.append(' '.join([spell(w).lower() for w in reTokenize(doc)]))      
    return sptext    

print(spell_correct(text)) 

#Output
#['hi welcome to spelling', 'this is just an example but consider a veri big corpus']

Run Code Online (Sandbox Code Playgroud)

作为替代方案，您可以使用列表理解来提高速度，也可以使用库，这可以提高本例中pyspellchecker单词的准确性：'veri'

from spellchecker import SpellChecker
import re

WORD = re.compile(r'\w+')
spell = SpellChecker()

def reTokenize(doc):
    tokens = WORD.findall(doc)
    return tokens

text = ["Hi, welcmoe to speling.","This is jsut an exapmle, but cosnider a veri big coprus."]

def spell_correct(text):
    sptext =  [' '.join([spell.correction(w).lower() for w in reTokenize(doc)])  for doc in text]    
    return sptext    

print(spell_correct(text))

Run Code Online (Sandbox Code Playgroud)

输出：

['hi welcome to spelling', 'this is just an example but consider a very big corpus']

Run Code Online (Sandbox Code Playgroud)

归档时间：	5 年，6 月前
查看次数：	5349 次
最近记录：	5 年，6 月前