如何按百分比分割文件.线?

alv*_*vas 9 bash awk split sed file

如何按百分比分割文件.线?

假设我想将文件分成3个部分(60%/ 20%/ 20%部分),我可以手动执行此操作,-_-:

$ wc -l brown.txt 
57339 brown.txt

$ bc <<< "57339 / 10 * 6"
34398
$ bc <<< "57339 / 10 * 2"
11466
$ bc <<< "34398 + 11466"
45864
bc <<< "34398 + 11466 + 11475"
57339

$ head -n 34398 brown.txt > part1.txt
$ sed -n 34399,45864p brown.txt > part2.txt
$ sed -n 45865,57339p brown.txt > part3.txt
$ wc -l part*.txt
   34398 part1.txt
   11466 part2.txt
   11475 part3.txt
   57339 total
Run Code Online (Sandbox Code Playgroud)

但我相信有更好的方法!

Ed *_*ton 9

$ cat file
a
b
c
d
e

$ cat tst.awk
BEGIN {
    split(pcts,p)
    nrs[1]
    for (i=1; i in p; i++) {
        pct += p[i]
        nrs[int(size * pct / 100) + 1]
    }
}
NR in nrs{ close(out); out = "part" ++fileNr ".txt" }
{ print $0 " > " out }

$ awk -v size=$(wc -l < file) -v pcts="60 20 20" -f tst.awk file
a > part1.txt
b > part1.txt
c > part1.txt
d > part2.txt
e > part3.txt
Run Code Online (Sandbox Code Playgroud)

更改" > "为仅>实际写入输出文件.


Ben*_* W. 9

有一个实用程序将行号作为参数,这些行号应成为每个相应新文件的第一个:csplit.这是它的POSIX版本的包装:

#!/bin/bash

usage () {
    printf '%s\n' "${0##*/} [-ks] [-f prefix] [-n number] file arg1..." >&2
}

# Collect csplit options
while getopts "ksf:n:" opt; do
    case "$opt" in
        k|s) args+=(-"$opt") ;;           # k: no remove on error, s: silent
        f|n) args+=(-"$opt" "$OPTARG") ;; # f: filename prefix, n: digits in number
        *) usage; exit 1 ;;
    esac
done
shift $(( OPTIND - 1 ))

fname=$1
shift
ratios=("$@")

len=$(wc -l < "$fname")

# Sum of ratios and array of cumulative ratios
for ratio in "${ratios[@]}"; do
    (( total += ratio ))
    cumsums+=("$total")
done

# Don't need the last element
unset cumsums[-1]

# Array of numbers of first line in each split file
for sum in "${cumsums[@]}"; do
    linenums+=( $(( sum * len / total + 1 )) )
done

csplit "${args[@]}" "$fname" "${linenums[@]}"
Run Code Online (Sandbox Code Playgroud)

在要拆分的文件的名称之后,它采用拆分文件的大小相对于它们的总和的比率,即,

percsplit brown.txt 60 20 20
percsplit brown.txt 6 2 2
percsplit brown.txt 3 1 1
Run Code Online (Sandbox Code Playgroud)

都是等价的.

与问题中的案例类似的用法如下:

$ percsplit -s -f part -n 1 brown.txt 60 20 20
$ wc -l part*
 34403 part0
 11468 part1
 11468 part2
 57339 total
Run Code Online (Sandbox Code Playgroud)

但编号从零开始,并且没有txt扩展名.在GNU版本支持--suffix-format的选项将允许.txt扩展,这可能被添加到接受的论点,但这需要的更精致,比getopts来分析它们.

这个解决方案适用于非常短的文件(将两行分成两行),繁重的工作csplit本身就完成了.