使用 awk 将多行文本转换为 CSV

Question

使用 awk 将多行文本转换为 CSV

tar*_*kan 1 regex linux awk text-processing

我正在查看一些评论，并试图确定购买苹果的最佳公司（例如）。我复制并粘贴了下面的文本，我想使用 Linux 命令对其进行一些文本处理。根据我在网上阅读的内容，awk 是一个不错的选择，但我无法让它工作。

\n

我尝试将具有评级的行并将其附加到上面的行并用逗号分隔。例如：Abes Apples\\n 4.1将成为Abes Apples, 4.1并且这将被重复。我测试的 awk 命令是 awk 'BEGIN {RS=""}{gsub(/\\n[0-9]/, ", ", $0); print $0}' test.text，它给出了下面的结果，但它正在替换数字..

\n

Abes Apples, .1,\n(138) \xc2\xb7 apple company, + years in business (123) 456-7890\nAdams Apples, .9,\n(105) \xc2\xb7 apple company, 0+ years in business (234) 567-8901\nApples are Amazing, .9,\n(13) apple company, 0+ years in business (345) 678-9012\n

Run Code Online (Sandbox Code Playgroud)\n

文本文件模式如下所示，并对文本文件中的所有行重复：

\n

公司名称
评分
评论数量和公司类型
经营年限和电话号码

\n

我的目标是将此文本文件转换为类似 csv 的格式，其中包含公司名称、评级、评论数量（忽略“苹果公司”文本）、业务年限和电话号码的列标题。这是可以用 awk 和其他 Linux 命令来完成的事情吗？

\n

电流输入：

\n

Abes Apples\n4.1,\n(138) \xc2\xb7 apple company\n7+ years in business (123) 456-7890\nAdams Apples\n4.9,\n(105) \xc2\xb7 apple company\n10+ years in business (234) 567-8901\nApples are Amazing\n3.9,\n(13) apple company\n10+ years in business (345) 678-9012\n

Run Code Online (Sandbox Code Playgroud)\n

期望的输出：

\n

Abes Apples, 4.1,(138), 7, (123) 456-7890\nAdams Apples, 4.9, (105), 10, (234) 567-8901\nApples are Amazing, 3.9, (13), 10, (345) 678-9012\n

Run Code Online (Sandbox Code Playgroud)\n

Answer 1

Rav*_*h13 6

RS使用GNU 中的段落模式，awk您可以尝试以下awk代码。仅使用您显示的示例进行编写和测试。使用matchGNU 的函数awk，其中使用正则表达式(^|\n)([^\n]*)\n([0-9]+(\.[0-9]+)?,)\n($[0-9]+$)[^\n]*\n([0-9]+)\+?[^(]*([^\n]*)（在这个答案中进一步解释）；这将创建一个名为arr1、2、3 的数组，依此类推，具体取决于要创建的捕获组的数量。

awk -v RS= -v OFS=", " '
{
  while(match($0,/(^|\n)([^\n]*)\n([0-9]+(\.[0-9]+)?,)\n(\([0-9]+\))[^\n]*\n([0-9]+)\+?[^(]*([^\n]*)/,arr)){
     print arr[2],arr[3]arr[5],arr[6],arr[7]
     $0=substr($0,RSTART+RLENGTH)
  }
}
'  Input_file

Run Code Online (Sandbox Code Playgroud)

输出如下：

Abes Apples, 4.1,(138), 7, (123) 456-7890
Adams Apples, 4.9,(105), 10, (234) 567-8901
Apples are Amazing, 3.9,(13), 10, (345) 678-9012

Run Code Online (Sandbox Code Playgroud)

说明：添加对所使用的正则表达式的详细说明。

(^|\n)         ##Creating 1st capturing group which has either starting of value OR new line.
([^\n]*)       ##Creating 2nd capturing group which contains everything just before next occurrence of new line.
\n             ##Matching a new line here.
([0-9]+(\.[0-9]+)?,) ##Creating 3rd and 4th capturing group and matchig digits(1 or more occurrences) followed by dot followed by 1 or more digits keeping 4th capturing group as optional.
\n             ##Matching a new line here.
(\([0-9]+\))   ##Creating 5th capturing group which has ( followed by digits followed by ).
[^\n]*\n       ##Matching everything just before new line followed by new line.
([0-9]+)       ##Creating 6th capturing group which has 1 or more digits in it.
\+?[^(]*       ##Matching literal + keeping it optional followed by everything just before (
([^\n]*)       ##Creating 7th capturing group and matching everything just before new line here.

Run Code Online (Sandbox Code Playgroud)

归档时间：	3 年，1 月前
查看次数：	156 次
最近记录：	3 年，1 月前