获取两个模式之间的字符串时出错

Question

获取两个模式之间的字符串时出错

我想在两个模式之间得到一个字符串。该模式是 html 文件中的第一个环境。

<p>Sorcery, 
          R (1)
          </p>
        <p class="ctext"><b>As an additional cost to cast Goblin Grenade, sacrifice a Goblin.<br><br>Goblin Grenade deals 5 damage to target creature or player.</b></p>


      <p><i>Don't underestimate the aerodynamic qualities of the common goblin.</i></p>
      <p>Illus. Kev Walker</p>

Run Code Online (Sandbox Code Playgroud)

该环境是文件的第一个环境，因此我丢弃匹配到的所有内容，并且我想删除.

name="goblin grenade"
wget -O- http://magiccards.info/query?q="$name" | grep -oP '<p>\K[^<]+'

Run Code Online (Sandbox Code Playgroud)

我不知道为什么它不能正常工作。我得到

Sorcery, 
Illus. Kev Walker

Run Code Online (Sandbox Code Playgroud)

Answer 1

Gil*_*not 5

不要使用正则表达式解析 HTML，而是使用适当的 HTML 解析器。

理论：

根据编译理论，不能使用基于有限状态机的正则表达式来解析 HTML 。由于 HTML 的层次结构，您需要使用下推自动机并使用YACC 等工具操作LALR语法。

realLife©®™ 日常工具：

相反，您应该使用正确的工具来完成正确的工作。

...这是xmllint的工作：

通过字符串匹配：

string="Sorcery"
xmllint --html --xpath "//p[contains(text(), '$string')]/text()" file_or_URL

Run Code Online (Sandbox Code Playgroud)

通过N 为 1的第 N 个节点：

xmllint --html --xpath "//p[1]/text()" file_or_URL

Run Code Online (Sandbox Code Playgroud)

检查/sf/ask/121264391/

归档时间：	10 年，8 月前
查看次数：	238 次
最近记录：	10 年，8 月前

获取两个模式之间的字符串时出错

理论 ：

realLife©®™ 日常工具：

理论：