Igo*_*kin 0 python search page-break python-docx
我逐段迭代文档,然后我将每个段落文本拆分为句子.
(带空格的点)。与在整个段落文本中搜索相比,我将句子中的段落文本拆分为更有效的文本搜索。
然后代码在句子的每个单词中搜索错误,错误来自纠错数据库。我在下面展示了一个简化的代码:
from docx.enum.text import WD_BREAK
for paragraph in document.paragraphs:
sentences = paragraph.text.split('. ')
for sentence in sentences:
words=sentence.split(' ')
for word in words:
for error in error_dictionary:
if error in word:
# (A) make simple replacement
word = word.replace(error, correction, 1)
# (B) alternative replacement based on runs
for run in paragraph.runs:
if error in run.text:
run.text = run.text.replace(error, correction, 1)
# here we may fetch page break attribute and knowing current number
# find out at what page the replacement has taken place
if run.page_break== WD_BREAK:
current_page_number +=1
replace_counter += 1
# write to a report what paragraph and what page
write_report(error, correction, sentence, current_page_number )
# for that I need to know a page break
Run Code Online (Sandbox Code Playgroud)
问题是如何识别运行(或其他段落元素)是否包含分页符?不run.page_break == WD_BREAK
工作?@scanny 已经展示了如何添加分页符,但如何识别它?
最好的情况是,如果您还可以识别段落中的换行符。
我可以:
for run in paragraph.runs:
if run._element.br_lst:
for br in run._element.br_lst:
br_couter+=1
print br.type
Run Code Online (Sandbox Code Playgroud)
然而,此代码仅显示Hard break,即通过Ctrl+Enter插入的中断。软分页符不检测...(软分页符,当用户保持打字,直到页面他是在用完然后流入到下一个页面形成)
任何提示?
对于软分页符和硬分页符,我现在使用以下内容:
for run in paragraph.runs:
if 'lastRenderedPageBreak' in run._element.xml:
print 'soft page break found at run:', run.text[:20]
if 'w:br' in run._element.xml and 'type="page"' in run._element.xml:
print 'hard page break found at run:', run.text[:20]
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
4897 次 |
最近记录: |