Cau*_*ity 1 regex url bash shell awk
例如,输入:
line1 col1-1 http://www.google.com/index.html col3-1 col4 col5 col6 col7 col8
line2 col1-2 https://user:pwd@www.facebook.com/pp/index.html col3-2 col4 col5 col6 col7 col8
line3 col1-3 badColumn col3-3 col4 col5 col6 col7 col8
Run Code Online (Sandbox Code Playgroud)
应该结果
line1 col1-1 http://www.google.com col3-1 col4 col5 col6 col7 col8
line2 col1-2 https://www.facebook.com col3-2 col4 col5 col6 col7 col8
line3 col1-3 badColumn col3-3 col4 col5 col6 col7 col8
Run Code Online (Sandbox Code Playgroud)
有可能用awk一个线性(sub和regex?)来实现它.否则,你将如何在bash中实现它?
我认为使用URL解析器可能会更好.例如,Python有:urlparse,可用于将URL解析为组件.这是一些示例代码,运行方式如下:
python3 script.py file
Run Code Online (Sandbox Code Playgroud)
内容script.py:
import sys
import csv
from urllib.parse import urlparse
with open(sys.argv[1], 'r') as csvfile:
r = csv.reader(csvfile, delimiter=' ')
for row in r:
url = urlparse(row[2]);
if (url.scheme and url.hostname):
row[2] = url.scheme + "://" + url.hostname
print(' '.join(row))
Run Code Online (Sandbox Code Playgroud)
结果:
line1 col1-1 http://www.google.com col3-1 col4 col5 col6 col7 col8
line2 col1-2 https://www.facebook.com col3-2 col4 col5 col6 col7 col8
line3 col1-3 badColumn col3-3 col4 col5 col6 col7 col8
Run Code Online (Sandbox Code Playgroud)