在 Shell 脚本中使用 CURL 解析 HTML

Question

在 Shell 脚本中使用 CURL 解析 HTML

我正在尝试在 shell 脚本中解析网页的特定内容。

我需要标签grep内的内容<div>。

<div class="tracklistInfo">
<p class="artist">Diplo - Justin Bieber - Skrillex</p>
<p>Where Are U Now</p>
</div>

Run Code Online (Sandbox Code Playgroud)

如果我使用grep -E -m 1 -o '<div class="tracklistInfo">'，简历只是<div class="tracklistInfo">

我如何访问艺术家(Diplo - Justin Bieber - Skrillex)以及如何访问标题(Where Are U Now)？

Answer 1

Cas*_*yte 5

使用 xmllint：

a='<div class="tracklistInfo">
<p class="artist">Diplo - Justin Bieber - Skrillex</p>
<p>Where Are U Now</p>
</div>'

xmllint --html --xpath 'concat(//div[@class="tracklistInfo"]/p[1]/text(), "#", //div[@class="tracklistInfo"]/p[2]/text())' <<<"$a"

Run Code Online (Sandbox Code Playgroud)

您获得：

Diplo - Justin Bieber - Skrillex#Where Are U Now

Run Code Online (Sandbox Code Playgroud)

那可以很容易地分开。

Answer 2

Mar*_*oij 1

不。使用 HTML 解析器。例如， BeautifulSoup for Python 很容易使用，并且可以很容易地做到这一点。

\n\n

话虽这么说，请记住这grep适用于线路。该模式匹配每一行，而不是整个字符串。

\n\n

你可以使用的是-A在比赛结束后打印出行：

\n\n

grep -A2 -E -m 1 \'<div class="tracklistInfo">\'\n

Run Code Online (Sandbox Code Playgroud)\n\n

应该输出：

\n\n

<div class="tracklistInfo">\n<p class="artist">Diplo - Justin Bieber - Skrillex</p>\n<p>Where Are U Now</p>\n

Run Code Online (Sandbox Code Playgroud)\n\n

然后，您可以通过管道将其传递到最后一行或倒数第二行tail：

\n\n

$ grep -A2 -E -m 1 \'<div class="tracklistInfo">\' | tail -n1\n<p>Where Are U Now</p>\n\n$ grep -A2 -E -m 1 \'<div class="tracklistInfo">\' |  tail -n2 | head -n1\n<p class="artist">Diplo - Justin Bieber - Skrillex</p>\n

Run Code Online (Sandbox Code Playgroud)\n\n

并剥离 HTMLsed：

\n\n

$ grep -A2 -E -m 1 \'<div class="tracklistInfo">\' | tail -n1\nWhere Are U Now\n\n$ grep -A2 -E -m 1 \'<div class="tracklistInfo">\' |  tail -n2 | head -n1 | sed \'s/<[^>]*>//g\'\nDiplo - Justin Bieber - Skrillex\n

Run Code Online (Sandbox Code Playgroud)\n\n

\n\n

但如前所述，这是善变的，可能会损坏，而且不太漂亮。顺便说一句，这与 BeautifulSoup 相同：

\n\n

html = \'\'\'<body>\n<p>Blah text</p>\n<div class="tracklistInfo">\n<p class="artist">Diplo - Justin Bieber - Skrillex</p>\n<p>Where Are U Now</p>\n</div>\n</body>\'\'\'\n\nfrom bs4 import BeautifulSoup\nsoup = BeautifulSoup(html, \'html.parser\')\n\nfor track in soup.find_all(class_=\'tracklistInfo\'):\n    print(track.find_all(\'p\')[0].text)\n    print(track.find_all(\'p\')[1].text)\n

Run Code Online (Sandbox Code Playgroud)\n\n

这也适用于多行tracklistInfo\xe2\x88\x92 将其添加到 shell 命令需要更多工作;-)

\n

除了“这很糟糕”之外，“没有任何作用”我无法提供有意义的输入;-) (4认同)

归档时间：	9 年，8 月前
查看次数：	9556 次
最近记录：	5 年，3 月前