在Linux shell或awk中,如何用其域替换行中的url

Cau*_*ity 1 regex url bash shell awk

例如,输入:

line1 col1-1 http://www.google.com/index.html col3-1 col4 col5 col6 col7 col8
line2 col1-2 https://user:pwd@www.facebook.com/pp/index.html col3-2 col4 col5 col6 col7 col8
line3 col1-3 badColumn col3-3 col4 col5 col6 col7 col8
Run Code Online (Sandbox Code Playgroud)

应该结果

line1 col1-1 http://www.google.com col3-1 col4 col5 col6 col7 col8
line2 col1-2 https://www.facebook.com col3-2 col4 col5 col6 col7 col8
line3 col1-3 badColumn col3-3 col4 col5 col6 col7 col8
Run Code Online (Sandbox Code Playgroud)

有可能用awk一个线性(subregex?)来实现它.否则,你将如何在bash中实现它?

Ste*_*eve 5

我认为使用URL解析器可能会更好.例如,Python有:urlparse,可用于将URL解析为组件.这是一些示例代码,运行方式如下:

python3 script.py file
Run Code Online (Sandbox Code Playgroud)

内容script.py:

import sys
import csv
from urllib.parse import urlparse


with open(sys.argv[1], 'r') as csvfile:

    r = csv.reader(csvfile, delimiter=' ')

    for row in r:

        url = urlparse(row[2]);

        if (url.scheme and url.hostname):

            row[2] = url.scheme + "://" + url.hostname

        print(' '.join(row))
Run Code Online (Sandbox Code Playgroud)

结果:

line1 col1-1 http://www.google.com col3-1 col4 col5 col6 col7 col8
line2 col1-2 https://www.facebook.com col3-2 col4 col5 col6 col7 col8
line3 col1-3 badColumn col3-3 col4 col5 col6 col7 col8
Run Code Online (Sandbox Code Playgroud)