使用Python-Docx获取docx文件中列表项的列表编号

Fre*_*ick 9 python ms-word python-3.x python-docx

当我访问段落文本时,它不包括列表中的编号。

当前代码:

document = Document("C:/Foo.docx")
for p in document.paragraphs:
     print(p.text)
Run Code Online (Sandbox Code Playgroud)

docx 文件中的列表:

编号列表

我期待:
(1) 两者的入籍 ...
(2) ... 的入籍
(3) ... 的入籍

我得到的结果:
两者的归化...
的归化...
的归化...

检查文档的 XML 后,列表编号存储在 w:abstructNum 中,但我不知道如何访问它们或将它们连接到正确的列表项。如何访问 python-docx 中每个列表项的编号,以便将它们包含在我的输出中?有没有办法使用 python-docx 确定这些列表的正确嵌套?

Cri*_*ati 7

根据[ReadThedocs.Python-DocX]:样式相关对象 - _NumberingStyle 对象,此功能尚未实现。
另一种选择(至少其中之一)[PyPI]:docx2python处理这些元素的方式很差(主要是因为它返回所有转换为字符串的内容)。

\n

因此,一个解决方案是手动解析XML文件 - 根据这个例子,根据经验发现如何解析。一个很好的文档位置是Office Open XML(我不知道它是否是所有处理.docx文件的工具(尤其是MS Word)所遵循的标准):

\n
    \n
  • 从word/document.xml获取每个段落(w:p节点) \n
      \n
    • 检查它是否是一个编号的项目(它有w:pPr -> w:numPr)子节点

      \n
    • \n
    • 获取数字样式Id和级别: w :numIdw:ilvl的w: val属性子节点(上一个项目符号的节点)

      \n
    • \n
    • 将 2 个值与 ( word/numbering.xml中的中)匹配:

      \n
        \n
      • w:abstractNum的w:abstractNumId属性节点
      • \n
      • w:ilvl w:lvl的属性子节点
      • \n
      \n

      获取对应的w:numFmtw:lvlText子节点的w:val属性(注意也包含了项目符号,可以根据前面提到的w:numFmt的项目符号值来区分它们)属性

      \n
    • \n
    \n
  • \n
\n

然而,这似乎非常复杂,所以我提出了一种利用docx2python的解决方法( gainarie )的部分支持

\n

测试文档(sample.docx -使用LibreOffice创建 ):

\n

图像0

\n

代码00.py

\n
#!/usr/bin/env python\n\nimport sys\nimport docx\nfrom docx2python import docx2python as dx2py\n\n\ndef ns_tag_name(node, name):\n    if node.nsmap and node.prefix:\n        return "{{{:s}}}{:s}".format(node.nsmap[node.prefix], name)\n    return name\n\n\ndef descendants(node, desc_strs):\n    if node is None:\n        return []\n    if not desc_strs:\n        return [node]\n    ret = {}\n    for child_str in desc_strs[0]:\n        for child in node.iterchildren(ns_tag_name(node, child_str)):\n            descs = descendants(child, desc_strs[1:])\n            if not descs:\n                continue\n            cd = ret.setdefault(child_str, [])\n            if isinstance(descs, list):\n                cd.extend(descs)\n            else:\n                cd.append(descs)\n    return ret\n\n\ndef simplified_descendants(desc_dict):\n    ret = []\n    for vs in desc_dict.values():\n        for v in vs:\n            if isinstance(v, dict):\n                ret.extend(simplified_descendants(v))\n            else:\n                ret.append(v)\n    return ret\n\n\ndef process_list_data(attrs, dx2py_elem):\n    #print(simplified_descendants(attrs))\n    desc = simplified_descendants(attrs)[0]\n    level = int(desc.attrib[ns_tag_name(desc, "val")])\n    elem = [i for i in dx2py_elem[0].split("\\t") if i][0]#.rstrip(")")\n    return "    " * level + elem + " "\n\n\ndef main(*argv):\n    fname = r"./sample.docx"\n    docd = docx.Document(fname)\n    docdpy = dx2py(fname)\n    dr = docdpy.docx_reader\n    #print(dr.files)  # !!! Check word/numbering.xml !!!\n    docdpy_runs = docdpy.document_runs[0][0][0]\n    if len(docd.paragraphs) != len(docdpy_runs):\n        print("Lengths don\'t match. Abort")\n        return -1\n    subnode_tags = (("pPr",), ("numPr",), ("ilvl",))  # (("pPr",), ("numPr",), ("ilvl", "numId"))  # numId is for matching elements from word/numbering.xml\n    for idx, (par, l) in enumerate(zip(docd.paragraphs, docdpy_runs)):\n        #print(par.text, l)\n        numbered_attrs = descendants(par._element, subnode_tags)\n        #print(numbered_attrs)\n        if numbered_attrs:\n            print(process_list_data(numbered_attrs, l) + par.text)\n        else:\n            print(par.text)\n\n\nif __name__ == "__main__":\n    print("Python {:s} {:03d}bit on {:s}\\n".format(" ".join(elem.strip() for elem in sys.version.split("\\n")),\n                                                   64 if sys.maxsize > 0x100000000 else 32, sys.platform))\n    rc = main(*sys.argv[1:])\n    print("\\nDone.")\n    sys.exit(rc)\n
Run Code Online (Sandbox Code Playgroud)\n

输出

\n
\n
[cfati@CFATI-5510-0:e:\\Work\\Dev\\StackOverflow\\q066374154]> "e:\\Work\\Dev\\VEnvs\\py_pc064_03.09_test0\\Scripts\\python.exe" code00.py\nPython 3.9.9 (tags/v3.9.9:ccb0e6a, Nov 15 2021, 18:08:50) [MSC v.1929 64 bit (AMD64)] 064bit on win32\n\nDoc title\ndoc subtitle\n\nheading1 text0\n\nParagr0 line0\nParagr0 line1\nParagr0 line2\n\nspace Paragr0 line3\na) aa (numbered)\nheading1 text1\nParagrx line0\nParagrx line1\n        a)      w tabs Paragrx line2 (NOT numbered \xe2\x80\x93 just to mimic 1ax below)\n\n1) paragrx 1x (numbered)\n    a) paragrx 1ax (numbered)\n        I) paragrx 1aIx (numbered)\n    b) paragrx 1bx (numbered)\n2) paragrx 2x (numbered)\n3) paragrx 3x (numbered)\n\n-- paragrx bullet 0\n    -- paragrx bullet 00\n\nparagxx text\n\nDone.\n
Run Code Online (Sandbox Code Playgroud)\n
\n

笔记

\n
    \n
  • 仅处理word/document.xml中的节点(通过段落的_element ( LXML节点)属性)
  • \n
  • 某些列表属性未捕获(由于docx2python的限制)
  • \n
  • 这离稳健还很远
  • \n
  • descendantssimple_descendants可以大大简化,但我想保持前者尽可能通用(如果需要扩展功能)
  • \n
\n