从DOCX中提取表格

Ekz*_*Man 5 c# docx openxml

使用OpenXML(C#)解析*.docx文档时遇到一个问题.

所以,这是我的步骤:
1.加载*.docx文档
2.接收段落列表
3.在每个段落中查找文本,图像和表格元素
4.为每个文本和图像元素创建html标签
5.将输出保存为*. html文件

我已经找到了如何在文档中找到图像文件并将其解压缩.现在还有一步 - 找到文本中的表位置(段落).

如果有人知道如何使用OpenXML在*.docx文档中找到表格请帮忙.谢谢.

附加:好的,可能是我不清楚解释我的意思.如果我们获得段落的内容,您可以找到chield对象作为文本块,图片等.所以,如果段落包含Run包含Picture那就意味着在这个地方放置Word文档放置图像.

我的功能样本:

public static string ParseDocxDocument(string pathToFile)
    {
        StringBuilder result = new StringBuilder();
        WordprocessingDocument wordProcessingDoc = WordprocessingDocument.Open(pathToFile, true);
        List<ImagePart> imgPart = wordProcessingDoc.MainDocumentPart.ImageParts.ToList();
        IEnumerable<Paragraph> paragraphElement = wordProcessingDoc.MainDocumentPart.Document.Descendants<Paragraph>();
        int imgCounter = 0;


        foreach (Paragraph par in paragraphElement)
        {

                //Add new paragraph tag
                result.Append("<div style=\"width:100%; text-align:");

                //Append anchor style
                if (par.ParagraphProperties != null && par.ParagraphProperties.Justification != null)
                    switch (par.ParagraphProperties.Justification.Val.Value)
                    {
                        case JustificationValues.Left:
                            result.Append("left;");
                            break;
                        case JustificationValues.Center:
                            result.Append("center;");
                            break;
                        case JustificationValues.Both:
                            result.Append("justify;");
                            break;
                        case JustificationValues.Right:
                        default:
                            result.Append("right;");
                            break;
                    }
                else
                    result.Append("left;");

                //Append text decoration style
                if (par.ParagraphProperties != null && par.ParagraphProperties.ParagraphMarkRunProperties != null && par.ParagraphProperties.ParagraphMarkRunProperties.HasChildren)
                    foreach (OpenXmlElement chield in par.ParagraphProperties.ParagraphMarkRunProperties.ChildElements)
                    {
                        switch (chield.GetType().Name)
                        {
                            case "Bold":
                                result.Append("font-weight:bold;");
                                break;
                            case "Underline":
                                result.Append("text-decoration:underline;");
                                break;
                            case "Italic":
                                result.Append("font-style:italic;");
                                break;
                            case "FontSize":
                                result.Append("font-size:" + ((FontSize)chield).Val.Value + "px;");
                                break;
                            default: break;
                        }
                    }

                result.Append("\">");

                //Add image tag
                IEnumerable<Run> runs = par.Descendants<Run>();
                foreach (Run run in runs)
                {
                    if (run.HasChildren)
                    {
                        foreach (OpenXmlElement chield in run.ChildElements.Where(o => o.GetType().Name == "Picture"))
                        {
                            result.Append(string.Format("<img style=\"{1}\" src=\"data:image/jpeg;base64,{0}\" />", GetBase64Image(imgPart[imgCounter].GetStream()),
                                           ((DocumentFormat.OpenXml.Vml.Shape)chield.ChildElements.Where(o => o.GetType().Name == "Shape").FirstOrDefault()).Style
                                ));
                            imgCounter++;
                        }
                    }
                }

                //Append inner text
                IEnumerable<Text> textElement = par.Descendants<Text>();
                if (par.Descendants<Text>().Count() == 0)
                    result.Append("<br />");

                foreach (Text t in textElement)
                {
                    result.Append(t.Text);
                }


                result.Append("</div>");
                result.Append(Environment.NewLine);

        }

        wordProcessingDoc.Close();

        return result.ToString();
    }
Run Code Online (Sandbox Code Playgroud)

现在我想在文本中指定表格位置(因为它出现在Word中).

最后:

好的,大家,我发现了.在我的示例函数中有一个大错误.我列举了文件Body的Paragraph元素.表与Paragraph处于同一级别,因此函数忽略表.所以我们需要枚举文档Body的元素.

这是我的测试函数从docx生成正确的HTML(它只是测试代码,所以它不干净)

public static string ParseDocxDocument(string pathToFile)
    {
        StringBuilder result = new StringBuilder();
        WordprocessingDocument wordProcessingDoc = WordprocessingDocument.Open(pathToFile, true);
        List<ImagePart> imgPart = wordProcessingDoc.MainDocumentPart.ImageParts.ToList();
        List<string> tableCellContent = new List<string>();
        IEnumerable<Paragraph> paragraphElement = wordProcessingDoc.MainDocumentPart.Document.Descendants<Paragraph>();
        int imgCounter = 0;

        foreach (OpenXmlElement section in wordProcessingDoc.MainDocumentPart.Document.Body.Elements<OpenXmlElement>())
        {
            if(section.GetType().Name == "Paragraph")
            {
              Paragraph par = (Paragraph)section;
                //Add new paragraph tag
                result.Append("<div style=\"width:100%; text-align:");

                //Append anchor style
                if (par.ParagraphProperties != null && par.ParagraphProperties.Justification != null)
                    switch (par.ParagraphProperties.Justification.Val.Value)
                    {
                        case JustificationValues.Left:
                            result.Append("left;");
                            break;
                        case JustificationValues.Center:
                            result.Append("center;");
                            break;
                        case JustificationValues.Both:
                            result.Append("justify;");
                            break;
                        case JustificationValues.Right:
                        default:
                            result.Append("right;");
                            break;
                    }
                else
                    result.Append("left;");

                //Append text decoration style
                if (par.ParagraphProperties != null && par.ParagraphProperties.ParagraphMarkRunProperties != null && par.ParagraphProperties.ParagraphMarkRunProperties.HasChildren)
                    foreach (OpenXmlElement chield in par.ParagraphProperties.ParagraphMarkRunProperties.ChildElements)
                    {
                        switch (chield.GetType().Name)
                        {
                            case "Bold":
                                result.Append("font-weight:bold;");
                                break;
                            case "Underline":
                                result.Append("text-decoration:underline;");
                                break;
                            case "Italic":
                                result.Append("font-style:italic;");
                                break;
                            case "FontSize":
                                result.Append("font-size:" + ((FontSize)chield).Val.Value + "px;");
                                break;
                            default: break;
                        }
                    }

                result.Append("\">");

                //Add image tag
                IEnumerable<Run> runs = par.Descendants<Run>();
                foreach (Run run in runs)
                {
                    if (run.HasChildren)
                    {
                        foreach (OpenXmlElement chield in run.ChildElements.Where(o => o.GetType().Name == "Picture"))
                        {
                            result.Append(string.Format("<img style=\"{1}\" src=\"data:image/jpeg;base64,{0}\" />", GetBase64Image(imgPart[imgCounter].GetStream()),
                                           ((DocumentFormat.OpenXml.Vml.Shape)chield.ChildElements.Where(o => o.GetType().Name == "Shape").FirstOrDefault()).Style
                                ));
                            imgCounter++;
                        }
                        foreach (OpenXmlElement table in run.ChildElements.Where(o => o.GetType().Name == "Table"))
                        {
                            result.Append("<strong>HERE'S TABLE</strong>");
                        }
                    }
                }

                //Append inner text
                IEnumerable<Text> textElement = par.Descendants<Text>();
                if (par.Descendants<Text>().Count() == 0)
                    result.Append("<br />");

                foreach (Text t in textElement.Where(o=>!tableCellContent.Contains(o.Text.Trim())))
                {
                    result.Append(t.Text);
                }


                result.Append("</div>");
                result.Append(Environment.NewLine);

            }
            else if (section.GetType().Name=="Table")
            {
                result.Append("<table>");
                Table tab = (Table)section;
                foreach (TableRow row in tab.Descendants<TableRow>())
                {
                    result.Append("<tr>");
                    foreach (TableCell cell in row.Descendants<TableCell>())
                    {
                        result.Append("<td>");
                        result.Append(cell.InnerText);
                        tableCellContent.Add(cell.InnerText.Trim());
                        result.Append("</td>");
                    }
                    result.Append("</tr>");
                }
                result.Append("</table>");
            }                
        }


        wordProcessingDoc.Close();

        return result.ToString();
    }

    private static string GetBase64Image(Stream inputData)
    {
        byte[] data = new byte[inputData.Length];
        inputData.Read(data, 0, data.Length);
        return Convert.ToBase64String(data);
    }
Run Code Online (Sandbox Code Playgroud)

mrd*_*mrd 2

尝试按照以下步骤查找文档中的第一个表格。

Table table = doc.MainDocumentPart.Document.Body.Elements<Table>().First();
Run Code Online (Sandbox Code Playgroud)