从单行输出中删除html/xml <tags>的最简单方法

use*_*360 3 html xml sed

我有来自grep的输出我正在尝试清理看起来像:

<words>Http://www.path.com/words</words>
Run Code Online (Sandbox Code Playgroud)

我试过用...

sed 's/<.*>//' 
Run Code Online (Sandbox Code Playgroud)

...删除标签,但这只会破坏整行.我不确定为什么会发生这种情况,因为每个"<"在到达内容之前都会以">"结束.

最简单的方法是什么?

谢谢!

小智 8

试试你的sed表达式:

sed 's/<.*>\(.*\)<\/.*>/\1/'
Run Code Online (Sandbox Code Playgroud)

快速细分表达式:

<.*>   - Match the first tag
\(.*\) - Match and save the text between the tags   
<\/.*> - Match the end tag making sure to escape the / character  
\1     - Output the result of the first saved match 
       -   (the text that is matched between \( and \))
Run Code Online (Sandbox Code Playgroud)

有关反向引用的更多信息

在评论中出现了一个问题,可能应该针对完整性加以解决.

这是\(\)塞德的反向参考标记.它们保存匹配表达式的一部分以供稍后使用.

例如,如果我们有一个输入字符串:

这里有(parens).此外,我们可以使用反向引用使用parenslike thisparens.

我们开发一个表达式:

sed s/.*(\(.*\)).*\1\\(.*\)\1.*/\1 \2/
Run Code Online (Sandbox Code Playgroud)

这给了我们:

parens like this
Run Code Online (Sandbox Code Playgroud)

这工作怎么样?让我们分解表达式以找出答案.

表达分解:

sed s/ - This is the opening tag to a sed expression.
.*     - Match any character to start (as well as nothing).
(      - Match a literal left parenthesis character.
\(.*\) - Match any character and save as a back-reference. In this case it will match anything between the first open and last close parenthesis in the expression.
)      - Match a literal right parenthesis character.
.*     - Same as above.
\1     - Match the first saved back-reference. In the case of our sample this is filled in with `parens`
\(.*\) - Same as above.
\1     - Same as above.
/      - End of the match expression. Signals transition to the output expression.
\1 \2  - Print our two back-references.
/      - End of output expression.
Run Code Online (Sandbox Code Playgroud)

我们可以看到,括号(())之间的反向引用被替换回匹配表达式,以便能够匹配字符串parens.