我逐段迭代文档,然后我将每个段落文本拆分为句子.(带空格的点)。与在整个段落文本中搜索相比,我将句子中的段落文本拆分为更有效的文本搜索。
然后代码在句子的每个单词中搜索错误,错误来自纠错数据库。我在下面展示了一个简化的代码:
from docx.enum.text import WD_BREAK
for paragraph in document.paragraphs:
sentences = paragraph.text.split('. ')
for sentence in sentences:
words=sentence.split(' ')
for word in words:
for error in error_dictionary:
if error in word:
# (A) make simple replacement
word = word.replace(error, correction, 1)
# (B) alternative replacement based on runs
for run in paragraph.runs:
if error in run.text:
run.text = run.text.replace(error, correction, 1)
# here we may fetch page break attribute and knowing current number …Run Code Online (Sandbox Code Playgroud)