从 xml 文件中删除标签

Var*_*wda 1 xml text-processing

我的文件包含无法明确识别的数据。像这样说:

<?xml version="1.0" encoding="UTF-8" ?><ns0:collection
xmlns:ns0="http://namspace/Service/1.0"><Record>
.
.</Record></ns0:collection>
Run Code Online (Sandbox Code Playgroud)

我必须将 N 个此类文件合并并创建一个文件。所以我需要完成以下工作:

  1. 我只需要从</ns0:collection>第一个文件中删除结束标签
  2. 删除接下来 (n-1) 个文件中的<?xml version="1.0" encoding="UTF-8" ?><ns0:collection xmlns:ns0="http://namspace/Service/1.0"></ns0:collection>
  3. 必须仅删除<?xml version="1.0" encoding="UTF-8" ?><ns0:collection xmlns:ns0="http://namspace/Service/1.0">最后一个文件并将它们全部合并在一起

我尝试使用sed命令处理第一个文件,但没有产生任何结果,“merged.xml”为空。

sed '/<\/ns0:collection>/d' $file1 > merged.xml
Run Code Online (Sandbox Code Playgroud)

有什么建议么?

Eri*_*ikF 5

You didn't specify that you could only use sed, so if you have access to xml_grep (see Merge multiple XML files from commend line, second answer), I would recommend that because it does a lot of the heavy work for you and for a simple merge job like this can be done in one command:

xml_grep --cond Record --wrap "ns0:collection" --descr 'xmlns:ns0="http://namespace/Service/1.0"' --encoding "UTF-8" *.xml
Run Code Online (Sandbox Code Playgroud)

Test files:

test.xml

<?xml version="1.0" encoding="UTF-8" ?><ns0:collection
xmlns:ns0="http://namespace/Service/1.0""><Record>
Test
</Record></ns0:collection>
Run Code Online (Sandbox Code Playgroud)

test1.xml

<?xml version="1.0" encoding="UTF-8" ?><ns0:collection
xmlns:ns0="http://namespace/Service/1.0"><Record>
Test 1<a>a</a><b c="c">d</b>
</Record></ns0:collection>
Run Code Online (Sandbox Code Playgroud)

Result

<?xml version="1.0" encoding="UTF-8" ?>
<ns0:collection xmlns:ns0="http://namespace/Service/1.0">
<Record>
Test 1<a>a</a><b c="c">d</b></Record><Record>
Test
</Record>
</ns0:collection>
Run Code Online (Sandbox Code Playgroud)

I prefer to use XML-aware tools when dealing with XML files, because the chance of messing up the structure with sed and friends is quite high and you can easily end up with a malformed XML document!