基于 CSV 文件的多列执行高度自定义的排序?

Vil*_*age 3 csv sorting bash

我有一个四列的 CSV 文件,使用@分隔符,例如:

\n\n
0001 @ fish @ animal @ eats worms\n
Run Code Online (Sandbox Code Playgroud)\n\n

第一列是唯一保证唯一的列。

\n\n

我需要对第 2、3 和 4 列执行四次排序操作。

\n\n

首先,第 2 列按字母数字顺序排序。这种排序的重要特征是它必须保证第 2 列中的任何重复条目彼此相邻,例如:

\n\n
@ a @ @\n@ a @ @\n@ a @ @\n@ a @ @\n@ a @ @\n@ b @ @\n@ b @ @\n@ c @ @  \n@ c @ @  \n@ c @ @  \n@ c @ @  \n@ c @ @  \n
Run Code Online (Sandbox Code Playgroud)\n\n

接下来,在第一个排序中,将行分为两类。第一行不包含单词 \xe2\x80\x9carch.\xe2\x80\x9d, \xe2\x80\x9cvar.\xe2\x80\x9d, \xe2\x80\x9cver.\xe2\x80 \x9d、\xe2\x80\x9canci.\xe2\x80\x9d 或 \xe2\x80\x9cfam.\xe2\x80\x9d 第 4 列中的任何位置。第二行(在后面排序)是包含这些单词的行,例如:

\n\n
@ a @ @ Does not have one of those words.\n@ a @ @ Does not have one of those words.\n@ a @ @ Does not have one of those words.\n@ a @ @ Does not have one of those words.\n@ a @ @ This sentence contains arch.\n@ b @ @ Does not have one of those words.\n@ b @ @ Has the word ver.\n@ c @ @ Does not have one of those words.\n@ c @ @ Does not have one of those words.\n@ c @ @ Does not have one of those words.\n@ c @ @ This sentence contains var.\n@ c @ @ This sentence contains fam.\n@ c @ @ This sentence contains fam.\n
Run Code Online (Sandbox Code Playgroud)\n\n

最后,仅在第二个排序的单独类别内进行排序,对从 \xe2\x80\x9ccontains themost重复条目内第 3\xe2\x80\x9d 到 \xe2\x80\x9ccontainsthe最少重复条目数的行进行排序第 3 列\xe2\x80\x9d,例如:

\n\n
@ a @ fish @ Does not have one of those words.\n@ a @ fish @ Does not have one of those words.\n@ a @ fish @ Does not have one of those words.\n@ a @ tiger @ Does not have one of those words.\n@ a @ bear @ This sentence contains arch.\n@ b @ fish @ Does not have one of those words.\n@ b @ fish @ Has the word ver.\n@ c @ bear @ Does not have one of those words.\n@ c @ bear @ Does not have one of those words.\n@ c @ fish @ Does not have one of those words.\n@ c @ tiger @ This sentence contains var.\n@ c @ tiger @ This sentence contains fam.\n@ c @ bear @ This sentence contains fam.\n
Run Code Online (Sandbox Code Playgroud)\n\n

如何按第 2 列、第 4 列中某些关键字的出现以及第 3 列中最常见重复项到最不常见重复项按字母数字顺序对文件进行排序?

\n

Kaz*_*Kaz 5

TXR:(http://www.nongnu.org/txr

@(bind special-words ("arch." "var." "ver." "anci." "fam."))
@(bind ahash @(hash :equal-based))
@(repeat)
@id @@ @alpha @@ @animal @@ @words
@  (rebind words @(split-str words " "))
@  (bind record (id alpha animal words))
@  (do (push record [ahash alpha]))
@(end)
@(bind sorted-rec-groups nil)
@(do
   (defun popularity-sort (recs)
     (let ((histogram [group-reduce (hash)
                                    third (do inc @1)
                                    recs 0]))
      [sort recs > [chain third histogram]]))

   (dohash (key records ahash)
     (let (contains does-not combined)
       (each* ((r records)
               (w [mapcar fourth r]))
         (if (isec w special-words)
           (push r contains)
           (push r does-not)))
       (push (append (popularity-sort does-not)                                 
                     (popularity-sort contains))                                
             sorted-rec-groups)))
   (set sorted-rec-groups [sort sorted-rec-groups :
                                [chain first second]]))
@(output)
@  (repeat)
@    (repeat)
@(rep)@{sorted-rec-groups} @@ @(last)@{sorted-rec-groups " "}@(end)
@    (end)
@  (end)
@(end)
Run Code Online (Sandbox Code Playgroud)

数据:

0001 @ b @ fish @ Does not have one of those words.
0002 @ a @ bear @ Does not have one of those words.
0003 @ b @ bear @ Has the word ver.
0004 @ a @ fish @ Does not have one of those words.
0005 @ c @ bear @ Does not have one of those words.
0006 @ c @ bear @ Does not have one of those words.
0007 @ a @ fish @ Does not have one of those words.
0008 @ c @ fish @ Does not have one of those words.
0009 @ a @ fish @ Does not have one of those words.
0010 @ c @ tiger @ This sentence contains var.
0011 @ c @ bear @ This sentence contains fam.
0012 @ a @ fish @ Does not have one of those words.
0013 @ c @ tiger @ This sentence contains fam.
Run Code Online (Sandbox Code Playgroud)

跑步:

$ txr sort.txr data.txt 
0004 @ a @ fish @ Does not have one of those words.
0007 @ a @ fish @ Does not have one of those words.
0009 @ a @ fish @ Does not have one of those words.
0012 @ a @ fish @ Does not have one of those words.
0002 @ a @ bear @ Does not have one of those words.
0001 @ b @ fish @ Does not have one of those words.
0003 @ b @ bear @ Has the word ver.
0005 @ c @ bear @ Does not have one of those words.
0006 @ c @ bear @ Does not have one of those words.
0008 @ c @ fish @ Does not have one of those words.
0010 @ c @ tiger @ This sentence contains var.
0013 @ c @ tiger @ This sentence contains fam.
0011 @ c @ bear @ This sentence contains fam.
Run Code Online (Sandbox Code Playgroud)