根据特定列中的值对整个 .csv 进行排序

pwr*_*ler 5 text-processing sort csv

我有一个csv包含不同收入的文件。我想按收入从高到低对 csv 文件进行排序。我无法找到如何在不使用 python 的情况下在终端中执行此操作。

我不想使用Python。

我想使用简单的东西,比如mlr// 。sedawk

输入:

name,location,capital,profit-lost,revenue,employees,year
company1,location1,35527.19,-33226.25,,0.70,2020
company2,location2,-155921.70,-146.03,,,2020
company3,location3,1873134.74,778424.56,13320152.32,16.90,2020
company4,location4,1050987.60,426317.61,,24.90,2021
company5,location5,368506.18,11997.04,,,2019
company6,location6,7965648.89,369947.14,64413602.44,103.30,2019
company7,location7,1531534.27,125750.94,3054307.36,12.10,2020
company8,location8,6161574.62,906591.96,124804038.64,51.30,2021
Run Code Online (Sandbox Code Playgroud)

输出:

name,location,capital,profit-lost,revenue,employees,year
company8,location8,6161574.62,906591.96,124804038.64,51.30,2021
company6,location6,7965648.89,369947.14,64413602.44,103.30,2019
company3,location3,1873134.74,778424.56,13320152.32,16.90,2020
company7,location7,1531534.27,125750.94,3054307.36,12.10,2020
company1,location1,35527.19,-33226.25,,0.70,2020
company2,location2,-155921.70,-146.03,,,2020
company4,location4,1050987.60,426317.61,,24.90,2021
company5,location5,368506.18,11997.04,,,2019
Run Code Online (Sandbox Code Playgroud)

收入空到几十亿。

希望有人也能帮助我解决这个问题

ann*_*hri 12

使用sort

cat input.csv | (sed -u 1q; sort -t, -r -n -k5) 
Run Code Online (Sandbox Code Playgroud)

需要sed -u 1q忽略sort标头。它基本上意味着,处理第一行并退出,然后将剩余的传递给sort. -u是 的缩写--unbuffered,告诉sed不要缓冲线路。

排序的标志:

  • -t,将分隔符指定为逗号。
  • -r使排序后的输出降序排列。默认是升序。
  • -n按数字排序。
  • -k5对第五个键/列进行排序。

演示:

$ cat input.csv | (sed -u 1q; sort -t, -r -n -k5)
name,location,capital,profit-lost,revenue,employees,year
company8,location8,6161574.62,906591.96,124804038.64,51.30,2021
company6,location6,7965648.89,369947.14,64413602.44,103.30,2019
company3,location3,1873134.74,778424.56,13320152.32,16.90,2020
company7,location7,1531534.27,125750.94,3054307.36,12.10,2020
company5,location5,368506.18,11997.04,,,2019
company4,location4,1050987.60,426317.61,,24.90,2021
company2,location2,-155921.70,-146.03,,,2020
company1,location1,35527.19,-33226.25,,0.70,2020
Run Code Online (Sandbox Code Playgroud)

  • 使用 GNU 排序,您可以通过在键 5 (`-k5,5`) 处停止排序并添加 `-s` 或 `--stable` 选项来获得 OP 显示的确切顺序 (7认同)

ste*_*ver 12

因此,您希望按数字降序对收入进行(稳定)排序,这听起来在 Miller 中应该很容易,除了它的空处理规则说:

具有一个或多个空排序字段值的记录在具有所有排序字段值的记录之后排序

这意味着它们首先按降序排序:

$ mlr --csv sort -nr revenue file.csv
name,location,capital,profit-lost,revenue,employees,year
company1,location1,35527.19,-33226.25,,0.70,2020
company2,location2,-155921.70,-146.03,,,2020
company4,location4,1050987.60,426317.61,,24.90,2021
company5,location5,368506.18,11997.04,,,2019
company8,location8,6161574.62,906591.96,124804038.64,51.30,2021
company6,location6,7965648.89,369947.14,64413602.44,103.30,2019
company3,location3,1873134.74,778424.56,13320152.32,16.90,2020
company7,location7,1531534.27,125750.94,3054307.36,12.10,2020
Run Code Online (Sandbox Code Playgroud)

然而,使用then 链接,可以直接使用将数字 0 分配给空收入的键进行装饰-排序-取消装饰:

$ mlr --csv put '$key = is_empty($revenue) ? 0 : $revenue' \
    then sort -nr key then cut -x -f key file.csv
name,location,capital,profit-lost,revenue,employees,year
company8,location8,6161574.62,906591.96,124804038.64,51.30,2021
company6,location6,7965648.89,369947.14,64413602.44,103.30,2019
company3,location3,1873134.74,778424.56,13320152.32,16.90,2020
company7,location7,1531534.27,125750.94,3054307.36,12.10,2020
company1,location1,35527.19,-33226.25,,0.70,2020
company2,location2,-155921.70,-146.03,,,2020
company4,location4,1050987.60,426317.61,,24.90,2021
company5,location5,368506.18,11997.04,,,2019
Run Code Online (Sandbox Code Playgroud)


Ed *_*ton 8

使用所有 Unix 系统上可用的强制 POSIX 工具:

$ { head -n 1; sort -t, -k5,5rn; } < file
name,location,capital,profit-lost,revenue,employees,year
company8,location8,6161574.62,906591.96,124804038.64,51.30,2021
company6,location6,7965648.89,369947.14,64413602.44,103.30,2019
company3,location3,1873134.74,778424.56,13320152.32,16.90,2020
company7,location7,1531534.27,125750.94,3054307.36,12.10,2020
company1,location1,35527.19,-33226.25,,0.70,2020
company2,location2,-155921.70,-146.03,,,2020
company4,location4,1050987.60,426317.61,,24.90,2021
company5,location5,368506.18,11997.04,,,2019
Run Code Online (Sandbox Code Playgroud)

请参阅下面的评论和head 可以读取比输出更多的输入行吗?有关上述脚本的其他重要信息。

  • 是否真的保证 [`head`](https://pubs.opengroup.org/onlinepubs/9699919799/utilities/head.html#tag_20_57) 不会提前读取超过指定的行数?据我所知,只指定了输出,并且“head”如果愿意的话可以使用缓冲I/O,并消耗更多输入。 (3认同)
  • @EdMorton,如果输入是可查找的,POSIX 要求“head”将文件中的位置保留在其输出的最后一行之后,但如果输入不可查找,则将其保留为未指定(如管道)。为了能够在管道中遵守它,它需要一次读取一个字节(就像“read”那样)或者在不读取管道的内容的情况下查看管道的内容,以知道在哪里停止读取然后再读取。IIRC,这就是 ksh93 内置的“head”的作用。如今,很少有系统具有可窥视的管道。我见过“head”实现不满足可查找文件的 POSIX 要求。 (3认同)
  • @roaima 完成,请参阅 https://unix.stackexchange.com/q/750523/133219 (3认同)
  • 我可以看到这个问题。`( 回声 a; 回声 b; 回声 c ) | ( head -n1; cat )` 此处仅返回一行 `a` 。我的手机上有 Termux,还有 Pi。快速加权加权平均法 (2认同)