使用 awk 将多行文本转换为 CSV

tar*_*kan 1 regex linux awk text-processing

我正在查看一些评论,并试图确定购买苹果的最佳公司(例如)。我复制并粘贴了下面的文本,我想使用 Linux 命令对其进行一些文本处理。根据我在网上阅读的内容,awk 是一个不错的选择,但我无法让它工作。

\n

我尝试将具有评级的行并将其附加到上面的行并用逗号分隔。例如:Abes Apples\\n 4.1将成为Abes Apples, 4.1并且这将被重复。我测试的 awk 命令是 awk 'BEGIN {RS=""}{gsub(/\\n[0-9]/, ", ", $0); print $0}' test.text,它给出了下面的结果,但它正在替换数字..

\n
Abes Apples, .1,\n(138) \xc2\xb7 apple company, + years in business (123) 456-7890\nAdams Apples, .9,\n(105) \xc2\xb7 apple company, 0+ years in business (234) 567-8901\nApples are Amazing, .9,\n(13) apple company, 0+ years in business (345) 678-9012\n
Run Code Online (Sandbox Code Playgroud)\n

文本文件模式如下所示,并对文本文件中的所有行重复:

\n
    \n
  1. 公司名称
  2. \n
  3. 评分
  4. \n
  5. 评论数量和公司类型
  6. \n
  7. 经营年限和电话号码
  8. \n
\n

我的目标是将此文本文件转换为类似 csv 的格式,其中包含公司名称、评级、评论数量(忽略“苹果公司”文本)、业务年限和电话号码的列标题。这是可以用 awk 和其他 Linux 命令来完成的事情吗?

\n

电流输入:

\n
Abes Apples\n4.1,\n(138) \xc2\xb7 apple company\n7+ years in business (123) 456-7890\nAdams Apples\n4.9,\n(105) \xc2\xb7 apple company\n10+ years in business (234) 567-8901\nApples are Amazing\n3.9,\n(13) apple company\n10+ years in business (345) 678-9012\n
Run Code Online (Sandbox Code Playgroud)\n

期望的输出:

\n
Abes Apples, 4.1,(138), 7, (123) 456-7890\nAdams Apples, 4.9, (105), 10, (234) 567-8901\nApples are Amazing, 3.9, (13), 10, (345) 678-9012\n
Run Code Online (Sandbox Code Playgroud)\n

Rav*_*h13 6

RS使用GNU 中的段落模式,awk您可以尝试以下awk代码。仅使用您显示的示例进行编写和测试。使用matchGNU 的函数awk,其中使用正则表达式(^|\n)([^\n]*)\n([0-9]+(\.[0-9]+)?,)\n(\([0-9]+\))[^\n]*\n([0-9]+)\+?[^(]*([^\n]*)(在这个答案中进一步解释);这将创建一个名为arr1、2、3 的数组,依此类推,具体取决于要创建的捕获组的数量。

awk -v RS= -v OFS=", " '
{
  while(match($0,/(^|\n)([^\n]*)\n([0-9]+(\.[0-9]+)?,)\n(\([0-9]+\))[^\n]*\n([0-9]+)\+?[^(]*([^\n]*)/,arr)){
     print arr[2],arr[3]arr[5],arr[6],arr[7]
     $0=substr($0,RSTART+RLENGTH)
  }
}
'  Input_file
Run Code Online (Sandbox Code Playgroud)

输出如下:

Abes Apples, 4.1,(138), 7, (123) 456-7890
Adams Apples, 4.9,(105), 10, (234) 567-8901
Apples are Amazing, 3.9,(13), 10, (345) 678-9012
Run Code Online (Sandbox Code Playgroud)

说明:添加对所使用的正则表达式的详细说明。

(^|\n)         ##Creating 1st capturing group which has either starting of value OR new line.
([^\n]*)       ##Creating 2nd capturing group which contains everything just before next occurrence of new line.
\n             ##Matching a new line here.
([0-9]+(\.[0-9]+)?,) ##Creating 3rd and 4th capturing group and matchig digits(1 or more occurrences) followed by dot followed by 1 or more digits keeping 4th capturing group as optional.
\n             ##Matching a new line here.
(\([0-9]+\))   ##Creating 5th capturing group which has ( followed by digits followed by ).
[^\n]*\n       ##Matching everything just before new line followed by new line.
([0-9]+)       ##Creating 6th capturing group which has 1 or more digits in it.
\+?[^(]*       ##Matching literal + keeping it optional followed by everything just before (
([^\n]*)       ##Creating 7th capturing group and matching everything just before new line here.
Run Code Online (Sandbox Code Playgroud)