tar*_*kan 1 regex linux awk text-processing
我正在查看一些评论,并试图确定购买苹果的最佳公司(例如)。我复制并粘贴了下面的文本,我想使用 Linux 命令对其进行一些文本处理。根据我在网上阅读的内容,awk 是一个不错的选择,但我无法让它工作。
\n我尝试将具有评级的行并将其附加到上面的行并用逗号分隔。例如:Abes Apples\\n 4.1将成为Abes Apples, 4.1并且这将被重复。我测试的 awk 命令是 awk 'BEGIN {RS=""}{gsub(/\\n[0-9]/, ", ", $0); print $0}' test.text,它给出了下面的结果,但它正在替换数字..
Abes Apples, .1,\n(138) \xc2\xb7 apple company, + years in business (123) 456-7890\nAdams Apples, .9,\n(105) \xc2\xb7 apple company, 0+ years in business (234) 567-8901\nApples are Amazing, .9,\n(13) apple company, 0+ years in business (345) 678-9012\nRun Code Online (Sandbox Code Playgroud)\n文本文件模式如下所示,并对文本文件中的所有行重复:
\n我的目标是将此文本文件转换为类似 csv 的格式,其中包含公司名称、评级、评论数量(忽略“苹果公司”文本)、业务年限和电话号码的列标题。这是可以用 awk 和其他 Linux 命令来完成的事情吗?
\n电流输入:
\nAbes Apples\n4.1,\n(138) \xc2\xb7 apple company\n7+ years in business (123) 456-7890\nAdams Apples\n4.9,\n(105) \xc2\xb7 apple company\n10+ years in business (234) 567-8901\nApples are Amazing\n3.9,\n(13) apple company\n10+ years in business (345) 678-9012\nRun Code Online (Sandbox Code Playgroud)\n期望的输出:
\nAbes Apples, 4.1,(138), 7, (123) 456-7890\nAdams Apples, 4.9, (105), 10, (234) 567-8901\nApples are Amazing, 3.9, (13), 10, (345) 678-9012\nRun Code Online (Sandbox Code Playgroud)\n
RS使用GNU 中的段落模式,awk您可以尝试以下awk代码。仅使用您显示的示例进行编写和测试。使用matchGNU 的函数awk,其中使用正则表达式(^|\n)([^\n]*)\n([0-9]+(\.[0-9]+)?,)\n(\([0-9]+\))[^\n]*\n([0-9]+)\+?[^(]*([^\n]*)(在这个答案中进一步解释);这将创建一个名为arr1、2、3 的数组,依此类推,具体取决于要创建的捕获组的数量。
awk -v RS= -v OFS=", " '
{
while(match($0,/(^|\n)([^\n]*)\n([0-9]+(\.[0-9]+)?,)\n(\([0-9]+\))[^\n]*\n([0-9]+)\+?[^(]*([^\n]*)/,arr)){
print arr[2],arr[3]arr[5],arr[6],arr[7]
$0=substr($0,RSTART+RLENGTH)
}
}
' Input_file
Run Code Online (Sandbox Code Playgroud)
输出如下:
Abes Apples, 4.1,(138), 7, (123) 456-7890
Adams Apples, 4.9,(105), 10, (234) 567-8901
Apples are Amazing, 3.9,(13), 10, (345) 678-9012
Run Code Online (Sandbox Code Playgroud)
说明:添加对所使用的正则表达式的详细说明。
(^|\n) ##Creating 1st capturing group which has either starting of value OR new line.
([^\n]*) ##Creating 2nd capturing group which contains everything just before next occurrence of new line.
\n ##Matching a new line here.
([0-9]+(\.[0-9]+)?,) ##Creating 3rd and 4th capturing group and matchig digits(1 or more occurrences) followed by dot followed by 1 or more digits keeping 4th capturing group as optional.
\n ##Matching a new line here.
(\([0-9]+\)) ##Creating 5th capturing group which has ( followed by digits followed by ).
[^\n]*\n ##Matching everything just before new line followed by new line.
([0-9]+) ##Creating 6th capturing group which has 1 or more digits in it.
\+?[^(]* ##Matching literal + keeping it optional followed by everything just before (
([^\n]*) ##Creating 7th capturing group and matching everything just before new line here.
Run Code Online (Sandbox Code Playgroud)