提取 DOCX 评论

Question

提取 DOCX 评论

And*_*ltz 2 python xml docx google-docs openxml

我是一名教师。我想要一份所有对我布置的文章发表评论的学生的名单，以及他们说了什么。Drive API 的东西对我来说太具有挑战性了，但我想我可以将它们作为 zip 下载并解析 XML。

评论被标记在w:comment标签中，w:t用于评论文本和。这应该很容易，但是 XML (etree) 正在杀死我。

通过教程（和官方 Python 文档）：

z = zipfile.ZipFile('test.docx')
x = z.read('word/comments.xml')
tree = etree.XML(x)

Run Code Online (Sandbox Code Playgroud)

然后我这样做：

children = tree.getiterator()
for c in children:
    print(c.attrib)

Run Code Online (Sandbox Code Playgroud)

结果是这样：

{}
{'{http://schemas.openxmlformats.org/wordprocessingml/2006/main}author': 'Joe Shmoe', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}id': '1', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}date': '2017-11-17T16:58:27Z'}
{'{http://schemas.openxmlformats.org/wordprocessingml/2006/main}rsidR': '00000000', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}rsidDel': '00000000', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}rsidP': '00000000', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}rsidRDefault': '00000000', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}rsidRPr': '00000000'}
{}
{'{http://schemas.openxmlformats.org/wordprocessingml/2006/main}val': '0'}
{'{http://schemas.openxmlformats.org/wordprocessingml/2006/main}val': '0'}
{'{http://schemas.openxmlformats.org/wordprocessingml/2006/main}val': '0'}

Run Code Online (Sandbox Code Playgroud)

在此之后，我完全被困住了。我试着element.get()和element.findall()没有运气。即使我复制/粘贴值 ( '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}val')，我也会得到None回报。

任何人都可以帮忙吗？

Answer 1

Sha*_*kla 12

感谢@kjhughes 提供了从文档文件中提取所有评论的惊人答案。我和该线程中的其他人一样面临着同样的问题，以获取评论相关的文本。我以 @kjhughes 的代码为基础，并尝试使用 python-docx 来解决这个问题。这是我对此的看法。

样本文档。

我将提取评论以及文档中引用的段落。

from docx import Document
from lxml import etree
import zipfile
ooXMLns = {'w':'http://schemas.openxmlformats.org/wordprocessingml/2006/main'}
#Function to extract all the comments of document(Same as accepted answer)
#Returns a dictionary with comment id as key and comment string as value
def get_document_comments(docxFileName):
    comments_dict={}
    docxZip = zipfile.ZipFile(docxFileName)
    commentsXML = docxZip.read('word/comments.xml')
    et = etree.XML(commentsXML)
    comments = et.xpath('//w:comment',namespaces=ooXMLns)
    for c in comments:
        comment=c.xpath('string(.)',namespaces=ooXMLns)
        comment_id=c.xpath('@w:id',namespaces=ooXMLns)[0]
        comments_dict[comment_id]=comment
    return comments_dict
#Function to fetch all the comments in a paragraph
def paragraph_comments(paragraph,comments_dict):
    comments=[]
    for run in paragraph.runs:
        comment_reference=run._r.xpath("./w:commentReference")
        if comment_reference:
            comment_id=comment_reference[0].xpath('@w:id',namespaces=ooXMLns)[0]
            comment=comments_dict[comment_id]
            comments.append(comment)
    return comments
#Function to fetch all comments with their referenced paragraph
#This will return list like this [{'Paragraph text': [comment 1,comment 2]}]
def comments_with_reference_paragraph(docxFileName):
    document = Document(docxFileName)
    comments_dict=get_document_comments(docxFileName)
    comments_with_their_reference_paragraph=[]
    for paragraph in document.paragraphs:  
        if comments_dict: 
            comments=paragraph_comments(paragraph,comments_dict)  
            if comments:
                comments_with_their_reference_paragraph.append({paragraph.text: comments})
    return comments_with_their_reference_paragraph
if __name__=="__main__":
    document="test.docx"  #filepath for the input document
    print(comments_with_reference_paragraph(document))

Run Code Online (Sandbox Code Playgroud)

示例文档的输出如下所示

我已经在段落级别完成了此操作。这也可以在 python-docx 运行级别完成。希望这会有所帮助。

Answer 2

kjh*_*hes 6

考虑到 OOXML 是一种如此复杂的格式，您已经走得很远了。

下面是一些示例 Python 代码，展示了如何通过 XPath 访问 DOCX 文件的注释：

from lxml import etree
import zipfile

ooXMLns = {'w':'http://schemas.openxmlformats.org/wordprocessingml/2006/main'}

def get_comments(docxFileName):
  docxZip = zipfile.ZipFile(docxFileName)
  commentsXML = docxZip.read('word/comments.xml')
  et = etree.XML(commentsXML)
  comments = et.xpath('//w:comment',namespaces=ooXMLns)
  for c in comments:
    # attributes:
    print(c.xpath('@w:author',namespaces=ooXMLns))
    print(c.xpath('@w:date',namespaces=ooXMLns))
    # string value of the comment:
    print(c.xpath('string(.)',namespaces=ooXMLns))

Run Code Online (Sandbox Code Playgroud)

Answer 3

Mai*_*Das 6

我使用Word 对象模型从 Word 文档中提取注释和回复。有关 Comments 对象的文档可以在此处找到。本文档使用 Visual Basic for Applications (VBA)。但我只需稍作修改就能使用 Python 中的函数。Word 对象模型的唯一问题是我必须使用 pywin32 中的 win32com 包，它在 Windows PC 上运行良好，但我不确定它是否可以在 macOS 上运行。

这是我用来提取评论和相关回复的示例代码：

    import win32com.client as win32
    from win32com.client import constants

    word = win32.gencache.EnsureDispatch('Word.Application')
    word.Visible = False 
    filepath = "path\to\file.docx"

    def get_comments(filepath):
        doc = word.Documents.Open(filepath) 
        doc.Activate()
        activeDoc = word.ActiveDocument
        for c in activeDoc.Comments: 
            if c.Ancestor is None: #checking if this is a top-level comment
                print("Comment by: " + c.Author)
                print("Comment text: " + c.Range.Text) #text of the comment
                print("Regarding: " + c.Scope.Text) #text of the original document where the comment is anchored 
                if len(c.Replies)> 0: #if the comment has replies
                    print("Number of replies: " + str(len(c.Replies)))
                    for r in range(1, len(c.Replies)+1):
                        print("Reply by: " + c.Replies(r).Author)
                        print("Reply text: " + c.Replies(r).Range.Text) #text of the reply
        doc.Close()

Run Code Online (Sandbox Code Playgroud)

但这是针对 Windows 的 (2认同)

归档时间：	8 年，6 月前
查看次数：	4972 次
最近记录：	5 年前