混合内容和字符串操作清理

Ten*_*nch 5 xml xslt xslt-2.0

我正处于将基于Word的文档转换为XML的非常痛苦的过程中.我遇到了以下问题:

<?xml version="1.0" encoding="UTF-8"?>
<root>
    <p>
        <element>This one is taken care of.</element> Some more text. „<hi rend="italics">Is this a
            quote</hi>?” (Source). </p>

    <p>
        <element>This one is taken care of.</element> Some more text. „<hi rend="italics">This is a
            quote</hi>” (Source). </p>

    <p>
        <element>This one is taken care of.</element> Some more text. „<hi rend="italics">This is
            definitely a quote</hi>!” (Source). </p>

    <p>
        <element>This one is taken care of.</element> Some more text.„<hi rend="italics">This is a
            first quote</hi>” (Source). „<hi rend="italics">Sometimes there is a second quote as
            well</hi>!?” (Source). </p>

</root>
Run Code Online (Sandbox Code Playgroud)

<p>节点有混合内容.<element>我在之前的迭代中已经处理过了.但现在问题在于引号和来源部分出现在<hi rend= "italics"/>文本节点内,部分出现在文本节点中.

我如何使用XSLT 2.0:

  1. 匹配<hi rend="italics">紧跟在最后一个字符为"""的文本节点之前的所有节点?
  2. 输出<hi rend="italics">as 的内容<quote>...</quote>,去除引号("""和"""),但包含在<quote/>任何问题和惊叹号内,这些问题和惊叹号出现在紧随兄弟姐妹之后<hi rend="italics">
  3. 将节点后面的"("和")"之间的文本节点转换<hi rend="italics"><source>...</source>不带括号.
  4. 包括最后的全站.

换句话说,我的输出应该如下所示:

<root>
<p>
<element>This one is taken care of.</element> Some more text. <quote>Is this a quote?</quote> <source>Source</source>.
</p>

<p>
<element>This one is taken care of.</element> Some more text. <quote>This is a quote</hi> <source>Source</source>.
</p>

<p>
<element>This one is taken care of.</element> Some more text. <quote>This is definitely a quote!</hi> <source>Source</source>.
</p>

<p>
<element>This one is taken care of.</element> Some more text. <quote>This is a first quote</quote> <source>Source</source>. <quote>Sometimes there is a second quote as well!?</quote> <source>Source</source>. 
</p>

</root>
Run Code Online (Sandbox Code Playgroud)

我从来没有像这样处理混合内容和字符串操作,整个事情真的让我失望.我将非常感谢你的提示.

Sea*_*kin 1

这是一个替代解决方案。它允许更多叙述风格的输入文档(引号内的引号、一个文本节点内的多个(源)片段、\'\xe2\x80\x9e\' 在后面不跟 hi 元素时作为数据)。

\n\n
<xsl:stylesheet version="2.0"\n  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"\n  xmlns:so="http://stackoverflow.com/questions/12690177"\n  xmlns:xs="http://www.w3.org/2001/XMLSchema"\n  exclude-result-prefixes="xsl xs so">\n<xsl:output omit-xml-declaration="yes" indent="yes" />\n<xsl:strip-space elements="*" />  \n\n<xsl:template match="@*|comment()|processing-instruction()">\n  <xsl:copy />\n</xsl:template>\n\n<xsl:template match="*">\n  <xsl:copy>\n    <xsl:apply-templates select="@*|node()" />\n  </xsl:copy>\n</xsl:template>\n\n<xsl:function name="so:clip-start" as="xs:string">\n  <xsl:param name="in-text" as="xs:string" />\n  <xsl:value-of select="substring($in-text,1,string-length($in-text)-1)" />\n</xsl:function>\n\n<xsl:function name="so:clip-end" as="xs:string">\n  <xsl:param name="in-text" as="xs:string" />\n  <xsl:value-of select="substring-after($in-text,\'\xe2\x80\x9d\')" />\n</xsl:function>\n\n<xsl:function name="so:matches-start" as="xs:boolean">\n  <xsl:param name="text-node" as="text()" />\n  <xsl:value-of select="$text-node/following-sibling::node()/self::hi[@rend=\'italics\'] and\n                        ends-with($text-node, \'\xe2\x80\x9e\')" />\n</xsl:function>\n\n<xsl:template match="text()[so:matches-start(.)]"    priority="2">\n  <xsl:call-template name="parse-text">\n   <xsl:with-param name="text" select="so:clip-start(.)" />\n  </xsl:call-template>\n</xsl:template>\n\n<xsl:function name="so:matches-end" as="xs:boolean">\n  <xsl:param name="text-node" as="text()" />\n  <xsl:value-of select="$text-node/preceding-sibling::node()/self::hi[@rend=\'italics\'] and\n                        matches($text-node,\'^[!?]*\xe2\x80\x9d\')" />\n</xsl:function>\n\n<xsl:template match="text()[so:matches-end(.)]"   priority="2">\n  <xsl:call-template name="parse-text">\n   <xsl:with-param name="text" select="so:clip-end(.)" />\n  </xsl:call-template>\n</xsl:template>\n\n<xsl:template match="text()[so:matches-start(.)][so:matches-end(.)]" priority="3">\n  <xsl:call-template name="parse-text">\n   <xsl:with-param name="text" select="so:clip-end(so:clip-start(.))" />\n  </xsl:call-template>\n</xsl:template>\n\n<xsl:template match="text()" name="parse-text" priority="1">\n  <xsl:param name="text" select="." />\n  <xsl:analyze-string select="$text" regex="\\(([^)]*)\\)">\n    <xsl:matching-substring>\n      <source>\n        <xsl:value-of select="regex-group(1)" />\n      </source>\n    </xsl:matching-substring>\n    <xsl:non-matching-substring>\n      <xsl:value-of select="." />\n    </xsl:non-matching-substring>\n  </xsl:analyze-string>\n</xsl:template>\n\n<xsl:template match="hi[@rend=\'italics\']">\n  <quote>\n    <xsl:apply-templates select="(@* except @rend) | node()" />\n    <xsl:for-each select="following-sibling::node()[1]/self::text()[matches(.,\'^[!?]\')]">\n      <xsl:value-of select="replace(., \'^([!?]+).*$\', \'$1\')" />\n    </xsl:for-each>   \n  </quote>\n</xsl:template>\n\n</xsl:stylesheet>\n
Run Code Online (Sandbox Code Playgroud)\n