使用 sed 从具有多个 URL 的文本中提取 URL

Question

使用 sed 从具有多个 URL 的文本中提取 URL

我有一个带有文本和几个 URL 的字符串。如何使用 sed 提取一个特定的 URL（特定域的）？例如，我有这个：

Text foo bar Text foo bar <br /><br /> http://www.this.file <br />http://another.file <br />http://mine.com/this.html <br />http://myURL.net/files/IWANTthis <br />http://www.google.com/thisnot

Run Code Online (Sandbox Code Playgroud)

并sed应返回： http://myURL.net/files/IWANTthis

Answer 1

r0b*_*rts 10

在特殊情况下使用 sed 可能会有一些麻烦。正如许多地方所建议的那样（例如） -不要使用正则表达式，而是使用 html 解析器引擎。一个这样容易获得的解析器包含在纯文本浏览器 lynx 中（在任何 linux 上都可用）。然后您只需使用grep 提取您想要的网址。

lynx -dump -listonly myhtmlfile.html | grep IWANTthis | sort -u

Run Code Online (Sandbox Code Playgroud)

但是，这不适用于损坏的 html 文件（无法正确解析）或带有链接的文本片段。另一种简单的方法是链接。如果您在名为 st3.txt 的文本文件中有与您类似的文本片段，您可以执行以下操作：

grep http ./st3.txt | sed 's/http/\nhttp/g' | grep ^http | sed 's/\(^http[^ <]*\)\(.*\)/\1/g' | grep IWANTthis | sort -u

Run Code Online (Sandbox Code Playgroud)

解释：

grep http ./st3.txt      => will catch lines with http from text file
sed 's/http/\nhttp/g'    => will insert newline before each http
grep ^http               => will take only lines starting with http
sed 's/\(^http[^ <]*\)\(.*\)/\1/g'   
                         => will preserve string from ^http until first space or <
grep IWANTthis           => will take only urls containing your text of interest
sort -u                  => will sort and remove duplicates from your list

Run Code Online (Sandbox Code Playgroud)

归档时间：	12 年前
查看次数：	19313 次
最近记录：	10 年，1 月前